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Preface 


These days, Python is undoubtedly one of the major strategic technology platforms in 
the financial industry. When I started writing the first edition of this book in 2013, I 
still had many conversations and presentations in which I argued relentlessly for 
Python’s competitive advantages in finance over other languages and platforms. 
Toward the end of 2018, this is not a question anymore: financial institutions around 
the world now simply try to make the best use of Python and its powerful ecosystem 
of data analysis, visualization, and machine learning packages. 


Beyond the realm of finance, Python is also often the language of choice in introduc- 
tory programming courses, such as in computer science programs. Beyond its reada- 
ble syntax and multiparadigm approach, a major reason for this is that Python has 
also become a first class citizen in the areas of artificial intelligence (AI), machine 
learning (ML), and deep learning (DL). Many of the most popular packages and 
libraries in these areas are either written directly in Python (such as scikit-learn for 
ML) or have Python wrappers available (such as TensorFlow for DL). 


Finance itself is entering a new era, and two major forces are driving this evolution. 
The first is the programmatic access to basically all the financial data available—in 
general, this happens in real time and is what leads to data-driven finance. Decades 
ago, most trading or investment decisions were driven by what traders and portfolio 
managers could read in the newspaper or learn through personal conversations. Then 
came terminals that brought financial data in real time to the traders’ and portfolio 
managers’ desks via computers and electronic communication. Today, individuals 
(or teams) can no longer keep up with the vast amounts of financial data generated in 
even a single minute. Only machines, with their ever-increasing processing speeds 
and computational power, can keep up with the volume and velocity of financial 
data. This means, among other things, that most of today’s global equities trading 
volume is driven by algorithms and computers rather than by human traders. 


The second major force is the increasing importance of AI in finance. More and 
more financial institutions try to capitalize on ML and DL algorithms to improve 
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operations and their trading and investment performances. At the beginning of 2018, 
the first dedicated book on “financial machine learning” was published, which under- 
scores this trend. Without a doubt, there are more to come. This leads to what might 
be called AlI-first finance, where flexible, parameterizable ML and DL algorithms 
replace traditional financial theory—theory that might be elegant but no longer very 
useful in the new era of data-driven, AI-first finance. 


Python is the right programming language and ecosystem to tackle the challenges of 
this era of finance. Although this book covers basic ML algorithms for unsupervised 
and supervised learning (as well as deep neural networks, for instance), the focus is 
on Python’s data processing and analysis capabilities. To fully account for the impor- 
tance of AI in finance—now and in the future—another book-length treatment is 
necessary. However, most of the AI, ML, and DL techniques require such large 
amounts of data that mastering data-driven finance should come first anyway. 


This second edition of Python for Finance is more of an upgrade than an update. For 
example, it adds a complete part (Part IV) about algorithmic trading. This topic has 
recently become quite important in the financial industry, and is also quite popular 
with retail traders. It also adds a more introductory part (Part II) where fundamental 
Python programming and data analysis topics are presented before they are applied 
in later parts of the book. On the other hand, some chapters from the first edition 
have been deleted completely. For instance, the chapter on web techniques and pack- 
ages (such as Flask) was dropped because there are more dedicated and focused 
books about such topics available today. 


For the second edition, I tried to cover even more finance-related topics and to focus 
on Python techniques that are particularly useful for financial data science, algorith- 
mic trading, and computational finance. As in the first edition, the approach is a 
practical one, in that implementation and illustration come before theoretical details 
and I generally focus on the big picture rather than the most arcane parameterization 
options of a certain class, method, or function. 


Having described the basic approach for the second edition, it is worth emphasizing 
that this book is neither an introduction to Python programming nor to finance in 
general. A vast number of excellent resources are available for both. This book is 
located at the intersection of these two exciting fields, and assumes that the reader 
has some background in programming (not necessarily Python) as well as in finance. 
Such readers learn how to apply Python and its ecosystem to the financial domain. 


The Jupyter Notebooks and codes accompanying this book can be accessed and exe- 
cuted via our Quant Platform. You can sign up for free at http://py4fi.pqp.io. 


My company (The Python Quants) and myself provide many more resources to mas- 
ter Python for financial data science, artificial intelligence, algorithmic trading, and 
computational finance. You can start by visiting the following sites: 
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e Our company website 

e My private website 

e Our Python books website 
e Our online training website 


e The Certificate Program website 


From all the offerings that we have created over the last few years, I am most proud of 
our Certificate Program in Python for Algorithmic Trading. It provides over 150 hours 
of live and recorded instruction, over 1,200 pages of documentation, over 5,000 lines 
of Python code, and over 50 Jupyter Notebooks. The program is offered multiple 
times per year and we update and improve it with every cohort. The online program 
is the first of its kind, in that successful delegates obtain an official university certifi- 
cate in cooperation with htw saar University of Applied Sciences. 


In addition, I recently started The AI Machine, a new project and company to stand- 
ardize the deployment of automated, algorithmic trading strategies. With this project, 
we want to implement in a systematic and scalable fashion what we have been teach- 
ing over the years in the field, in order to capitalize on the many opportunities in the 
algorithmic trading field. Thanks to Python—and data-driven and AI-first finance— 
this project is possible these days even for a smaller team like ours. 


I closed the preface for the first edition with the following words: 


I am really excited that Python has established itself as an important technology in the 

financial industry. I am also sure that it will play an even more important role there in 

the future, in fields like derivatives and risk analytics or high performance computing. 

My hope is that this book will help professionals, researchers, and students alike make 

the most of Python when facing the challenges of this fascinating field. 
When I wrote these lines in 2014, I couldn’t have predicted how important Python 
would become in finance. In 2018, I am even happier that my expectations and hopes 
have been so greatly surpassed. Maybe the first edition of the book played a small 
part in this. In any case, a big thank you is in order to all the relentless open source 
developers out there, without whom the success story of Python couldn’t have been 
written. 


Conventions Used in This Book 


The following typographical conventions are used in this book: 


Italic 
Indicates new terms, URLs, and email addresses. 
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Constant width 
Used for program listings, as well as within paragraphs to refer to software pack- 
ages, programming languages, file extensions, filenames, program elements such 
as variable or function names, databases, data types, environment variables, 
statements, and keywords. 


Constant width italic 
Shows text that should be replaced with user-supplied values or by values deter- 
mined by context. 


This element signifies a tip or suggestion. 


This element signifies a general note. 


This element indicates a warning or caution. 


Using Code Examples 


Supplemental material (in particular, Jupyter Notebooks and Python scripts/ 
modules) is available for usage and download at http://py4fi.pqp.io. 


This book is here to help you get your job done. In general, if example code is offered 
with this book, you may use it in your programs and documentation. You do not 
need to contact us for permission unless you’re reproducing a significant portion of 
the code. For example, writing a program that uses several chunks of code from this 
book does not require permission. Selling or distributing a CD-ROM of examples 
from O'Reilly books does require permission. Answering a question by citing this 
book and quoting example code does not require permission. Incorporating a signifi- 
cant amount of example code from this book into your product’s documentation 
does require permission. 


We appreciate, but do not require, attribution. An attribution usually includes the 
title, author, publisher, and ISBN. For example: “Python for Finance, 2nd Edition, by 
Yves Hilpisch (O’Reilly). Copyright 2019 Yves Hilpisch, 978-1-492-02433-0.” 
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If you feel your use of code examples falls outside fair use or the permission given 
above, feel free to contact us at permissions@oreilly.com. 


O'Reilly Online Learning 


R ə For more than 40 years, O’Reilly Media has provided technol- 
O REILLY ogy and business training, knowledge, and insight to help 


companies succeed. 


Our unique network of experts and innovators share their knowledge and expertise 
through books, articles, and our online learning platform. O’Reilly’s online learning 
platform gives you on-demand access to live training courses, in-depth learning 
paths, interactive coding environments, and a vast collection of text and video from 
O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. 


How to Contact Us 


Please address comments and questions concerning this book to the publisher: 


O’Reilly Media, Inc. 

1005 Gravenstein Highway North 

Sebastopol, CA 95472 

800-998-9938 (in the United States or Canada) 
707-829-0515 (international or local) 
707-829-0104 (fax) 


We have a web page for this book, where we list errata, examples, and any additional 
information. You can access this page at http://bit.ly/python-finance-2e. 


To comment or ask technical questions about this book, send email to bookques- 
tions@oreilly.com. 


For news and more information about our books and courses, see our website at 
http://www. oreilly.com. 


Find us on Facebook: http://facebook.com/oreilly 
Follow us on Twitter: http://twitter.com/oreillymedia 


Watch us on YouTube: http://www.youtube.com/oreillymedia 
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PART | 
Python and Finance 


This part introduces Python for finance. It consists of two chapters: 


e Chapter 1 briefly discusses Python in general and argues in some detail why 
Python is well suited to addressing the technological challenges in the financial 
industry as well as in financial data analytics. 

e Chapter 2 is about Python infrastructure; it provides a concise overview of 
important aspects of managing a Python environment to get you started with 
interactive financial analytics and financial application development in Python. 


CHAPTER 1 
Why Python for Finance 


Banks are essentially technology firms. 


—Hugo Banziger 


The Python Programming Language 


Python is a high-level, multipurpose programming language that is used in a wide 
range of domains and technical fields. On the Python website you find the following 
executive summary: 


Python is an interpreted, object-oriented, high-level programming language with 
dynamic semantics. Its high-level built in data structures, combined with dynamic typ- 
ing and dynamic binding, make it very attractive for Rapid Application Development, 
as well as for use as a scripting or glue language to connect existing components 
together. Python’s simple, easy to learn syntax emphasizes readability and therefore 
reduces the cost of program maintenance. Python supports modules and packages, 
which encourages program modularity and code reuse. The Python interpreter and the 
extensive standard library are available in source or binary form without charge for all 
major platforms, and can be freely distributed. 


This pretty well describes why Python has evolved into one of the major program- 
ming languages today. Nowadays, Python is used by the beginner programmer as 
well as by the highly skilled expert developer, at schools, in universities, at web com- 


panies, in large corporations and financial institutions, as well as in any scientific 
field. 


Among other features, Python is: 


Open source 
Python and the majority of supporting libraries and tools available are open 
source and generally come with quite flexible and open licenses. 


Interpreted 
The reference CPython implementation is an interpreter of the language that 
translates Python code at runtime to executable byte code. 


Multiparadigm 
Python supports different programming and implementation paradigms, such as 
object orientation and imperative, functional, or procedural programming. 


Multipurpose 
Python can be used for rapid, interactive code development as well as for build- 
ing large applications; it can be used for low-level systems operations as well as 
for high-level analytics tasks. 


Cross-platform 
Python is available for the most important operating systems, such as Windows, 
Linux, and macOS. It is used to build desktop as well as web applications, and it 
can be used on the largest clusters and most powerful servers as well as on such 
small devices as the Raspberry Pi. 


Dynamically typed 
Types in Python are in general inferred at runtime and not statically declared as 
in most compiled languages. 


Indentation aware 
In contrast to the majority of other programming languages, Python uses inden- 
tation for marking code blocks instead of parentheses, brackets, or semicolons. 


Garbage collecting 
Python has automated garbage collection, avoiding the need for the programmer 
to manage memory. 


When it comes to Python syntax and what Python is all about, Python Enhancement 
Proposal 20—i.e., the so-called “Zen of Python”—provides the major guidelines. It 
can be accessed from every interactive shell with the command import this: 


In [1]: import this 
The Zen of Python, by Tim Peters 


Beautiful is better than ugly. 
Explicit is better than implicit. 
Simple is better than complex. 
Complex is better than complicated. 
Flat is better than nested. 

Sparse is better than dense. 
Readability counts. 

Special cases aren't special enough to break the rules. 
Although practicality beats purity. 
Errors should never pass silently. 
Unless explicitly silenced. 
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In the face of ambiguity, refuse the temptation to guess. 
There should be one-- and preferably only one --obvious way to do it. 
Although that way may not be obvious at first unless you're Dutch. 


Now is better than never. 


Although never is often better than *right* now. 
If the implementation is hard to explain, it's a bad idea. 
If the implementation is easy to explain, it may be a good idea. 


Namespaces are one honking great idea 


A Brief History of Python 


Although Python might still have the appeal of something new to some people, it has 


been around for quite a long time. In fact, development efforts began in the 1980s by 


-- let's do more of those! 


Guido van Rossum from the Netherlands. He is still active in Python development 
and has been awarded the title of Benevolent Dictator for Life by the Python commu- 


nity. In July 2018, van Rossum stepped down from this position after decades of 


being an active driver of the Python core development efforts. The following can be 


considered milestones in the development of Python: 


Python 0.9.0 released in 1991 (first release) 


Python 1.0 released in 1994 
Python 2.0 released in 2000 
Python 2.6 released in 2008 
Python 3.0 released in 2008 
Python 3.1 released in 2009 
Python 2.7 released in 2010 
Python 3.2 released in 2011 
Python 3.3 released in 2012 
Python 3.4 released in 2014 
Python 3.5 released in 2015 
Python 3.6 released in 2016 


Python 3.7 released in June 2018 


It is remarkable, and sometimes confusing to Python newcomers, that there are two 
major versions available, still being developed and, more importantly, in parallel use 


since 2008. As of this writing, this will probably keep on for a little while since tons of 


code available and in production is still Python 2.6/2.7. While the first edition of this 
book was based on Python 2.7, this second edition uses Python 3.7 throughout. 


The Python Programming Language 
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The Python Ecosystem 


A major feature of Python as an ecosystem, compared to just being a programming 
language, is the availability of a large number of packages and tools. These packages 
and tools generally have to be imported when needed (e.g., a plotting library) or have 
to be started as a separate system process (e.g., a Python interactive development 
environment). Importing means making a package available to the current name- 
space and the current Python interpreter process. 


Python itself already comes with a large set of packages and modules that enhance the 
basic interpreter in different directions, known as the Python Standard Library. For 
example, basic mathematical calculations can be done without any importing, while 
more specialized mathematical functions need to be imported through the math mod- 
ule: 


In [2]: 100 * 2.5 + 50 
Out[2]: 300.0 


In [3]: log(i) © 


NameError Traceback (most recent call last) 
<ipython-input-3-74f22a2fd43b> in <module> 
----> 1 log(1) 


NameError: name 'log' is not defined 
In [4]: import math (2) 


In [5]: math.log(1) (2) 
Out[5]: 0.0 


@ Without further imports, an error is raised. 


© After importing the math module, the calculation can be executed. 


While math is a standard Python module available with any Python installation, there 
are many more packages that can be installed optionally and that can be used in the 
very same fashion as the standard modules. Such packages are available from differ- 
ent (web) sources. However, it is generally advisable to use a Python package man- 
ager that makes sure that all libraries are consistent with each other (see Chapter 2 for 
more on this topic). 


The code examples presented so far use interactive Python environments: [Python 
and Jupyter, respectively. These are probably the most widely used interactive Python 
environments at the time of this writing. Although IPython started out as just an 
enhanced interactive Python shell, it today has many features typically found in inte- 
grated development environments (IDEs), such as support for profiling and debug- 
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ging. Those features missing in [Python are typically provided by advanced text/code 
editors, like Vim, which can also be integrated with IPython. Therefore, it is not 
unusual to combine [Python with one’s text/code editor of choice to form the basic 
toolchain for a Python development process. 


IPython enhances the standard interactive shell in many ways. Among other things, it 
provides improved command-line history functions and allows for easy object 
inspection. For instance, the help text (docstring) for a function is printed by just 
adding a ? before or after the function name (adding ?? will provide even more infor- 
mation). 


IPython originally came in two popular versions: a shell version and a browser-based 
version (the Notebook). The Notebook variant proved so useful and popular that it 
evolved into an independent, language-agnostic project now called Jupyter. Given 
this background, it is no surprise that Jupyter Notebook inherits most of the benefi- 
cial features of IPython—and offers much more, for example when it comes to visual- 
ization. 


Refer to VanderPlas (2016, Chapter 1) for more details on using IPython. 


The Python User Spectrum 


Python does not only appeal to professional software developers; it is also of use for 
the casual developer as well as for domain experts and scientific developers. 


Professional software developers find in Python all they might require to efficiently 
build large applications. Almost all programming paradigms are supported; there are 
powerful development tools available; and any task can, in principle, be addressed 
with Python. These types of users typically build their own frameworks and classes, 
also work on the fundamental Python and scientific stack, and strive to make the 
most of the ecosystem. 


Scientific developers or domain experts are generally heavy users of certain packages 
and frameworks, have built their own applications that they enhance and optimize 
over time, and tailor the ecosystem to their specific needs. These groups of users also 
generally engage in longer interactive sessions, rapidly prototyping new code as well 
as exploring and visualizing their research and/or domain data sets. 


Casual programmers like to use Python generally for specific problems they know 
that Python has its strengths in. For example, visiting the gallery page of matplotlib, 
copying a certain piece of visualization code provided there, and adjusting the code 
to their specific needs might be a beneficial use case for members of this group. 


There is also another important group of Python users: beginner programmers, i.e., 
those that are just starting to program. Nowadays, Python has become a very popular 
language at universities, colleges, and even schools to introduce students to program- 
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ming.’ A major reason for this is that its basic syntax is easy to learn and easy to 
understand, even for the non-developer. In addition, it is helpful that Python sup- 
ports almost all programming styles.” 


The Scientific Stack 


There is a certain set of packages that is collectively labeled the scientific stack. This 
stack comprises, among others, the following packages: 


NumPy 
NumPy provides a multidimensional array object to store homogeneous or hetero- 
geneous data; it also provides optimized functions/methods to operate on this 
array object. 


SciPy 
SciPy is a collection of subpackages and functions implementing important stan- 
dard functionality often needed in science or finance; for example, one finds 
functions for cubic splines interpolation as well as for numerical integration. 


matpLlotlib 
This is the most popular plotting and visualization package for Python, provid- 
ing both 2D and 3D visualization capabilities. 


pandas 
pandas builds on NumPy and provides richer classes for the management and 
analysis of time series and tabular data; it is tightly integrated with matplotlib 
for plotting and PyTables for data storage and retrieval. 


scikit-learn 
scikit-learn is a popular machine learning (ML) package that provides a uni- 
fied application programming interface (API) for many different ML algorithms, 
such as for estimation, classification, or clustering. 


PyTables 
PyTables is a popular wrapper for the HDF5 data storage package; it is a package 
to implement optimized, disk-based I/O operations based on a hierarchical data- 
base/file format. 


1 Python, for example, is a major language used in the Master of Financial Engineering Program at Baruch Col- 
lege of the City University of New York. The first edition of this book is in use at a large number of universi- 
ties around the world to teach Python for financial analysis and application building. 


2 See http://wiki.python.org/moin/BeginnersGuide, where you will find links to many valuable resources for 
both developers and non-developers getting started with Python. 
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Depending on the specific domain or problem, this stack is enlarged by additional 
packages, which more often than not have in common that they build on top of one 
or more of these fundamental packages. However, the least common denominator or 
basic building blocks in general are the NumPy ndarray class (see Chapter 4) and the 
pandas DataFrame class (see Chapter 5). 


Taking Python as a programming language alone, there are a number of other lan- 
guages available that can probably keep up with its syntax and elegance. For example, 
Ruby is a popular language often compared to Python. The language’s website 
describes Ruby as: 


A dynamic, open source programming language with a focus on simplicity and pro- 
ductivity. It has an elegant syntax that is natural to read and easy to write. 


The majority of people using Python would probably also agree with the exact same 
statement being made about Python itself. However, what distinguishes Python for 
many users from equally appealing languages like Ruby is the availability of the scien- 
tific stack. This makes Python not only a good and elegant language to use, but also 
one that is capable of replacing domain-specific languages and tool sets like Matlab 
or R. It also provides by default anything that you would expect, say, as a seasoned 
web developer or systems administrator. In addition, Python is good at interfacing 
with domain-specific languages such as R, so that the decision usually is not about 
either Python or something else—it is rather about which language should be the 
major one. 


Technology in Finance 


With these “rough ideas” of what Python is all about, it makes sense to step back a bit 
and to briefly contemplate the role of technology in finance. This will put one in a 
position to better judge the role Python already plays and, even more importantly, 
will probably play in the financial industry of the future. 


In a sense, technology per se is nothing special to financial institutions (as compared, 
for instance, to biotechnology companies) or to the finance function (as compared to 
other corporate functions, like logistics). However, in recent years, spurred by inno- 
vation and also regulation, banks and other financial institutions like hedge funds 
have evolved more and more into technology companies instead of being just finan- 
cial intermediaries. Technology has become a major asset for almost any financial 
institution around the globe, having the potential to lead to competitive advantages 
as well as disadvantages. Some background information can shed light on the reasons 
for this development. 
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Technology Spending 


Banks and financial institutions together form the industry that spends the most on 
technology on an annual basis. The following statement therefore shows not only that 
technology is important for the financial industry, but that the financial industry is 
also really important to the technology sector: 


FRAMINGHAM, Mass., June 14, 2018 - Worldwide spending on information tech- 
nology (IT) by financial services firms will be nearly $500 billion in 2021, growing 
from $440 billion in 2018, according to new data from a series of Financial Services IT 
Spending Guides from International Data Corporation (IDC). 


—IDC 


In particular, banks and other financial institutions are engaging in a race to make 
their business and operating models digital: 


Bank spending on new technologies was predicted to amount to 19.9 billion US. dol- 
lars in 2017 in North America. 

The banks develop current systems and work on new technological solutions in order 
to increase their competitiveness on the global market and to attract clients interested 
in new online and mobile technologies. It is a big opportunity for global fintech com- 
panies which provide new ideas and software solutions for the banking industry. 


—Statista 


Large multinational banks today generally employ thousands of developers to main- 
tain existing systems and build new ones. Large investment banks with heavy techno- 
logical requirements often have technology budgets of several billion USD per year. 


Technology as Enabler 


The technological development has also contributed to innovations and efficiency 
improvements in the financial sector. Typically, projects in this area run under the 
umbrella of digitalization. 


The financial services industry has seen drastic technology-led changes over the past 
few years. Many executives look to their IT departments to improve efficiency and 
facilitate game-changing innovation—while somehow also lowering costs and con- 
tinuing to support legacy systems. Meanwhile, FinTech start-ups are encroaching 
upon established markets, leading with customer-friendly solutions developed from 
the ground up and unencumbered by legacy systems. 


—PwC 19th Annual Global CEO Survey 2016 


As a side effect of the increasing efficiency, competitive advantages must often be 
looked for in ever more complex products or transactions. This in turn inherently 
increases risks and makes risk management as well as oversight and regulation more 
and more difficult. The financial crisis of 2007 and 2008 tells the story of potential 
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dangers resulting from such developments. In a similar vein, “algorithms and com- 
puters gone wild” represent a potential risk to the financial markets; this materialized 
dramatically in the so-called flash crash of May 2010, where automated selling led to 
large intraday drops in certain stocks and stock indices. Part IV covers topics related 
to the algorithmic trading of financial instruments. 


Technology and Talent as Barriers to Entry 


On the one hand, technology advances reduce cost over time, ceteris paribus. On the 
other hand, financial institutions continue to invest heavily in technology to both 
gain market share and defend their current positions. To be active today in certain 
areas in finance often brings with it the need for large-scale investments in both tech- 
nology and skilled staff. As an example, consider the derivatives analytics space: 


Aggregated over the total software lifecycle, firms adopting in-house strategies for 
OTC [derivatives] pricing will require investments between $25 million and $36 mil- 
lion alone to build, maintain, and enhance a complete derivatives library. 


—Ding (2010) 


Not only is it costly and time-consuming to build a full-fledged derivatives analytics 
library, but you also need to have enough experts to do so. And these experts have to 
have the right tools and technologies available to accomplish their tasks. With the 
development of the Python ecosystem, such efforts have become more efficient and 
budgets in this regard can be reduced significantly today compared to, say, 10 years 
ago. Part V covers derivatives analytics and builds a small but powerful and flexible 
derivatives pricing library with Python and standard Python packages alone. 


Another quote about the early days of Long-Term Capital Management (LTCM), for- 
merly one of the most respected quantitative hedge funds—which, however, went 
bust in the late 1990s—further supports this insight about technology and talent: 


Meriwether spent $20 million on a state-of-the-art computer system and hired a crack 
team of financial engineers to run the show at LTCM, which set up shop in Greenwich, 
Connecticut. It was risk management on an industrial level. 


—Patterson (2010) 


The same computing power that Meriwether had to buy for millions of dollars is 
today probably available for thousands or can be rented from a cloud provider based 
on a flexible fee plan. Chapter 2 shows how to set up an infrastructure in the cloud 
for interactive financial analytics, application development, and deployment with 
Python. The budgets for such a professional infrastructure start at a few USD per 
month. On the other hand, trading, pricing, and risk management have become so 
complex for larger financial institutions that today they need to deploy IT infrastruc- 
tures with tens of thousands of computing cores. 
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Ever-Increasing Speeds, Frequencies, and Data Volumes 


The one dimension of the finance industry that has been influenced most by techno- 
logical advances is the speed and frequency with which financial transactions are 
decided and executed. Lewis (2014) describes so-called flash trading—i.e., trading at 
the highest speeds possible—in vivid detail. 


On the one hand, increasing data availability on ever-smaller time scales makes it 
necessary to react in real time. On the other hand, the increasing speed and frequency 
of trading makes the data volumes further increase. This leads to processes that rein- 
force each other and push the average time scale for financial transactions systemati- 
cally down. This is a trend that had already started a decade ago: 


Renaissance’s Medallion fund gained an astonishing 80 percent in 2008, capitalizing 
on the market’s extreme volatility with its lightning-fast computers. Jim Simons was 
the hedge fund world’s top earner for the year, pocketing a cool $2.5 billion. 


—Patterson (2010) 


Thirty years’ worth of daily stock price data for a single stock represents roughly 
7,500 closing quotes. This kind of data is what most of today’s finance theory is based 
on. For example, modern or mean-variance portfolio theory (MPT), the capital asset 
pricing model (CAPM), and value-at-risk (VaR) all have their foundations in daily 
stock price data. 


In comparison, on a typical trading day during a single trading hour the stock price 
of Apple Inc. (AAPL) may be quoted around 15,000 times—roughly twice the number 
of quotes compared to available end-of-day closing quotes over 30 years (see the 
example in “Data-Driven and AI-First Finance” on page 24). This brings with it a 
number of challenges: 


Data processing 
It does not suffice to consider and process end-of-day quotes for stocks or other 
financial instruments; “too much” happens during the day, and for some instru- 
ments during 24 hours for 7 days a week. 


Analytics speed 
Decisions often have to be made in milliseconds or even faster, making it neces- 
sary to build the respective analytics capabilities and to analyze large amounts of 
data in real time. 


Theoretical foundations 
Although traditional finance theories and concepts are far from being perfect, 
they have been well tested (and sometimes well rejected) over time; for the milli- 
second and microsecond scales important as of today, consistent financial con- 
cepts and theories in the traditional sense that have proven to be somewhat 
robust over time are still missing. 
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All these challenges can in general only be addressed by modern technology. Some- 
thing that might also be a little bit surprising is that the lack of consistent theories 
often is addressed by technological approaches, in that high-speed algorithms exploit 
market microstructure elements (e.g., order flow, bid-ask spreads) rather than relying 
on some kind of financial reasoning. 


The Rise of Real-Time Analytics 


There is one discipline that has seen a strong increase in importance in the finance 
industry: financial and data analytics. This phenomenon has a close relationship to 
the insight that speeds, frequencies, and data volumes increase at a rapid pace in the 
industry. In fact, real-time analytics can be considered the industry’s answer to this 
trend. 


Roughly speaking, “financial and data analytics” refers to the discipline of applying 
software and technology in combination with (possibly advanced) algorithms and 
methods to gather, process, and analyze data in order to gain insights, to make deci- 
sions, or to fulfill regulatory requirements, for instance. Examples might include the 
estimation of sales impacts induced by a change in the pricing structure for a finan- 
cial product in the retail branch of a bank, or the large-scale overnight calculation of 
credit valuation adjustments (CVA) for complex portfolios of derivatives trades of an 
investment bank. 


There are two major challenges that financial institutions face in this context: 


Big data 
Banks and other financial institutions had to deal with massive amounts of data 
even before the term “big data” was coined; however, the amount of data that has 
to be processed during single analytics tasks has increased tremendously over 
time, demanding both increased computing power and ever-larger memory and 
storage capacities. 


Real-time economy 
In the past, decision makers could rely on structured, regular planning as well as 
decision and (risk) management processes, whereas they today face the need to 
take care of these functions in real time; several tasks that have been taken care of 
in the past via overnight batch runs in the back office have now been moved to 
the front office and are executed in real time. 


Again, one can observe an interplay between advances in technology and financial/ 
business practice. On the one hand, there is the need to constantly improve analytics 
approaches in terms of speed and capability by applying modern technologies. On 
the other hand, advances on the technology side allow new analytics approaches that 
were considered impossible (or infeasible due to budget constraints) a couple of years 
or even months ago. 
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One major trend in the analytics space has been the utilization of parallel architec- 
tures on the central processing unit (CPU) side and massively parallel architectures 
on the general-purpose graphics processing unit (GPGPU) side. Current GPGPUs 
have computing cores in the thousands, making necessary a sometimes radical 
rethinking of what parallelism might mean to different algorithms. What is still an 
obstacle in this regard is that users generally have to learn new programming para- 
digms and techniques to harness the power of such hardware. 


Python for Finance 


The previous section described selected aspects characterizing the role of technology 
in finance: 


e Costs for technology in the finance industry 

e Technology as an enabler for new business and innovation 

e Technology and talent as barriers to entry in the finance industry 
e Increasing speeds, frequencies, and data volumes 


e The rise of real-time analytics 


This section analyzes how Python can help in addressing several of the challenges 
these imply. But first, on a more fundamental level, a brief analysis of Python for 
finance from a language and syntax point of view. 


Finance and Python Syntax 


Most people who make their first steps with Python in a finance context may attack 
an algorithmic problem. This is similar to a scientist who, for example, wants to solve 
a differential equation, evaluate an integral, or simply visualize some data. In general, 
at this stage, little thought is given to topics like a formal development process, test- 
ing, documentation, or deployment. However, this especially seems to be the stage 
where people fall in love with Python. A major reason for this might be that Python 
syntax is generally quite close to the mathematical syntax used to describe scientific 
problems or financial algorithms. 


This can be illustrated by a financial algorithm, namely the valuation of a European 
call option by Monte Carlo simulation. The example considers a Black-Scholes- 
Merton (BSM) setup in which the option’s underlying risk factor follows a geometric 
Brownian motion. 


Assume the following numerical parameter values for the valuation: 


Initial stock index level S, = 100 
e Strike price of the European call option K = 105 
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e Time to maturity T = 1 year 
e Constant, riskless short rate r = 0.05 


e Constant volatility ø = 0.2 


In the BSM model, the index level at maturity is a random variable given by Equation 
1-1, with z being a standard normally distributed random variable. 


Equation 1-1. Black-Scholes-Merton (1973) index level at maturity 


1 
Sr = So exp (z - a)r + oJTz) 


The following is an algorithmic description of the Monte Carlo valuation procedure: 
1. Draw I pseudo-random numbers z(i), i € {1, 2, ..., I}, from the standard normal 
distribution. 


2. Calculate all resulting index levels at maturity S,(i) for given z(i) and Equation 
1-1. 


3. Calculate all inner values of the option at maturity as h(i) = max(S,(i) - K, 0). 


4. Estimate the option present value via the Monte Carlo estimator as given in 
Equation 1-2. 


Equation 1-2. Monte Carlo estimator for European option 


1 
Cy = e7" 7È hr(i) 
I 


This problem and algorithm must now be translated into Python. The following code 
implements the required steps: 


In [6]: import math 
import numpy as np @ 


In [7]: S0 = 100. @ 


K = 105. @ 
T=1.0 O 
r=0.05 @ 


sigma = 0.2 (2) 
In [8]: I = 100000 @ 
In [9]: np.random.seed(1000) © 


In [10]: z = np.random.standard_normal(I) (4) 
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In [11]: ST = SO * np.exp((r - sigma ** 2 / 2) * T + sigma * math.sqrt(T) * z) (5) 
In [12]: hT = np.maximum(ST - K, 0) Q 
In [13]: CO = math.exp(-r * T) * np.mean(hT) (7) 


In [14]: print('Value of the European call option: {:5.3f}.'.format(C0)) (8 ] 
Value of the European call option: 8.019. 


NumPy is used here as the main package. 

The model and simulation parameter values are defined. 
The seed value for the random number generator is fixed. 
Standard normally distributed random numbers are drawn. 
End-of-period values are simulated. 


The option payoffs at maturity are calculated. 


© © © © O 8 8 


The Monte Carlo estimator is evaluated. 


The resulting value estimate is printed. 
Three aspects are worth highlighting: 


Syntax 
The Python syntax is indeed quite close to the mathematical syntax, e.g., when it 
comes to the parameter value assignments. 


Translation 
Every mathematical and/or algorithmic statement can generally be translated 
into a single line of Python code. 


Vectorization 
One of the strengths of NumPy is the compact, vectorized syntax, e.g., allowing for 
100,000 calculations within a single line of code. 


This code can be used in an interactive environment like IPython or Jupyter Note- 
book. However, code that is meant to be reused regularly typically gets organized in 
so-called modules (or scripts), which are single Python files (technically text files) 
with the suffix .py. Such a module could in this case look like Example 1-1 and could 
be saved as a file named bsm_mcs_euro.py. 
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Example 1-1. Monte Carlo valuation of European call option 


# 

# Monte Carlo valuation of European call option 
# in Black-Scholes-Merton model 

# bsm_mcs_euro.py 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

import math 

import numpy as np 


# Parameter Values 

SQ = 100. # initial index level 
K = 105. # strike price 
T=1.0 # time-to-maturity 

r= 0.05 # riskless short rate 
sigma = 0.2 # volatility 


I = 100000 # number of simulations 
# Valuation Algorithm 


z = np.random.standard_normal(1) # pseudo-random numbers 
# index values at maturity 


ST = SO * np.exp((r - 0.5 * sigma ** 2) * T + sigma * math.sqrt(T) * z) 


hT = np.maximum(ST - K, 0) # payoff at maturity 
CO = math.exp(-r * T) * np.mean(hT) # Monte Carlo estimator 


# Result Output 
print('Value of the European call option %5.3f.' % CO) 


The algorithmic example in this subsection illustrates that Python, with its very syn- 
tax, is well suited to complement the classic duo of scientific languages, English and 
mathematics. It seems that adding Python to the set of scientific languages makes it 


more well rounded. One then has: 


¢ English for writing and talking about scientific and financial problems, etc. 


e Mathematics for concisely, exactly describing and modeling abstract aspects, algo- 


rithms, complex quantities, etc. 


e Python for technically modeling and implementing abstract aspects, algorithms, 


complex quantities, etc. 


Python for Finance 
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Mathematics and Python Syntax 


There is hardly any programming language that comes as close to 
mathematical syntax as Python. Numerical algorithms are there- 
fore in general straightforward to translate from the mathematical 
representation into the Pythonic implementation. This makes pro- 
totyping, development, and code maintenance in finance quite effi- 
cient with Python. 


In some areas, it is common practice to use pseudo-code and therewith to introduce a 
fourth language family member. The role of pseudo-code is to represent, for example, 
financial algorithms in a more technical fashion that is both still close to the mathe- 
matical representation and already quite close to the technical implementation. In 
addition to the algorithm itself, pseudo-code takes into account how computers work 
in principle. 


This practice generally has its cause in the fact that with most (compiled) program- 
ming languages the technical implementation is quite “far away” from its formal, 
mathematical representation. The majority of programming languages make it neces- 
sary to include so many elements that are only technically required that it is hard to 
see the equivalence between the mathematics and the code. 


Nowadays, Python is often used in a pseudo-code way since its syntax is almost analo- 
gous to the mathematics and since the technical “overhead” is kept to a minimum. 
This is accomplished by a number of high-level concepts embodied in the language 
that not only have their advantages but also come in general with risks and/or other 
costs. However, it is safe to say that with Python you can, whenever the need arises, 
follow the same strict implementation and coding practices that other languages 
might require from the outset. In that sense, Python can provide the best of both 
worlds: high-level abstraction and rigorous implementation. 


Efficiency and Productivity Through Python 
At a high level, benefits from using Python can be measured in three dimensions: 


Efficiency 
How can Python help in getting results faster, in saving costs, and in saving time? 
Productivity 


How can Python help in getting more done with the same resources (people, 
assets, etc.)? 


Quality 
What does Python allow one to do that alternative technologies do not allow for? 
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A discussion of these aspects can by nature not be exhaustive. However, it can high- 
light some arguments as a starting point. 


Shorter time-to-results 


A field where the efficiency of Python becomes quite obvious is interactive data ana- 
lytics. This is a field that benefits tremendously from such powerful tools as [Python, 
Jupyter Notebook, and packages like pandas. 


Consider a finance student who is writing their master’s thesis and is interested in 
S&P 500 index values. They want to analyze historical index levels for, say, a few 
years to see how the volatility of the index has fluctuated over time and hope to find 
evidence that volatility, in contrast to some typical model assumptions, fluctuates 
over time and is far from being constant. The results should also be visualized. The 
student mainly has to do the following: 


e Retrieve index level data from the web 
e Calculate the annualized rolling standard deviation of the log returns (volatility) 


e Plot the index level data and the volatility results 


These tasks are complex enough that not too long ago one would have considered 
them to be something for professional financial analysts only. Today, even the 
finance student can easily cope with such problems. The following code shows how 
exactly this works—without worrying about syntax details at this stage (everything is 
explained in detail in subsequent chapters): 


In [16]: import numpy as np (13 
import pandas as pd 
from pylab import plt, mpl (2) 


In [17]: plt.style.use('seaborn') (2) 
mpl.rcParams['font.family'] = 'serif' (2) 
%matplotlib inline 


In [18]: data = pd.read_csv('../../source/tr_eikon_eod_data.csv', 

index_col=0, parse_dates=True) © 

data = pd.DataFrame(data['.SPX']) (4) 

data.dropna(inplace=True) 

data.info() (5) 

<class 'pandas.core.frame.DataFrame'> 

DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29 

Data columns (total 1 columns): 

. SPX 2138 non-null float64 

dtypes: float64(1) 

memory usage: 33.4 KB 


In [19]: data['rets'] = np.log(data / data.shift(1)) (6) 
data['vola'] = data['rets'].rolling(252).std() * np.sqrt(252) @ 
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In [20]: data[['.SPX', 'vola']].plot(subplots=True, figsize=(10, 6)); (8 ] 
This imports NumPy and pandas. 


This imports matplotlib and configures the plotting style and approach for 
Jupyter. 


© pd.read_csv() allows the retrieval of remotely or locally stored data sets in 
comma-separated values (CSV) form. 


A subset of the data is picked and NaN (“not a number”) values eliminated. 
This shows some metainformation about the data set. 


The log returns are calculated in vectorized fashion (“no looping” on the Python 
level). 


@ The rolling, annualized volatility is derived. 


© This finally plots the two time series. 


Figure 1-1 shows the graphical result of this brief interactive session. It can be consid- 
ered almost amazing that a few lines of code suffice to implement three rather com- 
plex tasks typically encountered in financial analytics: data gathering, complex and 
repeated mathematical calculations, as well as visualization of the results. The exam- 
ple illustrates that pandas makes working with whole time series almost as simple as 
doing mathematical operations on floating-point numbers. 


Translated to a professional finance context, the example implies that financial ana- 
lysts can—when applying the right Python tools and packages that provide high-level 
abstractions—focus on their domain and not on the technical intrinsicalities. Ana- 
lysts can also react faster, providing valuable insights almost in real time and making 
sure they are one step ahead of the competition. This example of increased efficiency 
can easily translate into measurable bottom-line effects. 
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Figure 1-1. S&P 500 closing values and annualized volatility 


Ensuring high performance 


In general, it is accepted that Python has a rather concise syntax and that it is rela- 
tively efficient to code with. However, due to the very nature of Python being an 
interpreted language, the prejudice persists that Python often is too slow for 
compute-intensive tasks in finance. Indeed, depending on the specific implementa- 
tion approach, Python can be really slow. But it does not have to be slow—it can be 
highly performing in almost any application area. In principle, one can distinguish at 


least three different strategies for better performance: 


Idioms and paradigms 


In general, many different ways can lead to the same result in Python, but some- 
times with rather different performance characteristics; “simply” choosing the 
right way (e.g., a specific implementation approach, such as the judicious use of 
data structures, avoiding loops through vectorization, or the use of a specific 


package such as pandas) can improve results significantly. 


Compiling 
Nowadays, there are several performance packages available that provide com- 


piled versions of important functions or that compile Python code statically or 
dynamically (at runtime or call time) to machine code, which can make such 
functions orders of magnitude faster than pure Python code; popular ones are 


Cython and Numba. 
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Parallelization 
Many computational tasks, in particular in finance, can significantly benefit from 
parallel execution; this is nothing special to Python but something that can easily 
be accomplished with it. 


Performance Computing with Python 


Python per se is not a high-performance computing technology. 
However, Python has developed into an ideal platform to access 
current performance technologies. In that sense, Python has 
become something like a glue language for performance computing 
technologies. 


This subsection sticks to a simple, but still realistic, example that touches upon all 
three strategies (later chapters illustrate the strategies in detail). A quite common task 
in financial analytics is to evaluate complex mathematical expressions on large arrays 
of numbers. To this end, Python itself provides everything needed: 


In [21]: import math 
loops = 2500000 
a = range(1, loops) 
def f(x): 
return 3 * math. log(x) + math.cos(x) ** 2 
%timeit r = [f(x) for x in a] 
1.59 s + 41.2 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


The Python interpreter needs about 1.6 seconds in this case to evaluate the function 
f() 2,500,000 times. The same task can be implemented using NumPy, which provides 
optimized (i.e., precompiled) functions to handle such array-based operations: 
In [22]: import numpy as np 

a = np.arange(1, loops) 

%timeit r = 3 * np.log(a) + np.cos(a) ** 2 

87.9 ms + 1.73 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 
Using NumPy considerably reduces the execution time to about 88 milliseconds. How- 
ever, there is even a package specifically dedicated to this kind of task. It is called 
numexpr, for “numerical expressions.” It compiles the expression to improve upon the 
performance of the general NumPy functionality by, for example, avoiding in-memory 
copies of ndarray objects along the way: 


In [23]: import numexpr as ne 
ne.set_num_threads(1) 
f = '3 * Log(a) + cos(a) ** 2" 
%timeit r = ne.evaluate(f) 
50.6 ms + 4.2 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 
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Using this more specialized approach further reduces execution time to about 50 
milliseconds. However, numexpr also has built-in capabilities to parallelize the execu- 
tion of the respective operation. This allows us to use multiple threads of a CPU: 


In [24]: ne.set_num_threads(4) 
%timeit r = ne.evaluate(f) 
22.8 ms + 1.76 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 


Parallelization brings execution time further down to below 23 milliseconds in this 
case, with four threads utilized. Overall, this is a performance improvement of more 
than 90 times. Note, in particular, that this kind of improvement is possible without 
altering the basic problem/algorithm and without knowing any detail about compil- 
ing or parallelization approaches. The capabilities are accessible from a high level 
even by non-experts. However, one has to be aware, of course, of which capabilities 
and options exist. 


This example shows that Python provides a number of options to make more out of 
existing resources—i.e., to increase productivity. With the parallel approach, three 
times as many calculations can be accomplished in the same amount of time as com- 
pared to the sequential approach—in this case simply by telling Python to use multi- 
ple available CPU threads instead of just one. 


From Prototyping to Production 


Efficiency in interactive analytics and performance when it comes to execution speed 
are certainly two benefits of Python to consider. Yet another major benefit of using 
Python for finance might at first sight seem a bit subtler; at second sight, it might 
present itself as an important strategic factor for financial institutions. It is the possi- 
bility to use Python end-to-end, from prototyping to production. 


Today’s practice in financial institutions around the globe, when it comes to financial 
development processes, is still often characterized by a separated, two-step process. 
On the one hand, there are the quantitative analysts (“quants”) responsible for model 
development and technical prototyping. They like to use tools and environments like 
Matlab and R that allow for rapid, interactive application development. At this stage 
of the development efforts, issues like performance, stability, deployment, access 
management, and version control, among others, are not that important. One is 
mainly looking for a proof of concept and/or a prototype that exhibits the main 
desired features of an algorithm or a whole application. 


Once the prototype is finished, IT departments with their developers take over and 
are responsible for translating the existing prototype code into reliable, maintainable, 
and performant production code. Typically, at this stage there is a paradigm shift in 
that compiled languages, such as C++ or Java, are used to fulfill the requirements for 
deployment and production. Also, a formal development process with professional 
tools, version control, etc., is generally applied. 
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This two-step approach has a number of generally unintended consequences: 


Inefficiencies 
Prototype code is not reusable; algorithms have to be implemented twice; redun- 
dant efforts take time and resources; risks arise during translation 


Diverse skill sets 
Different departments show different skill sets and use different languages to 
implement “the same things”; people not only program but also speak different 
languages 


Legacy code 
Code is available and has to be maintained in different languages, often using dif- 
ferent styles of implementation 


Using Python, on the other hand, enables a streamlined end-to-end process from the 
first interactive prototyping steps to highly reliable and efficiently maintainable pro- 
duction code. The communication between different departments becomes easier. 
The training of the workforce is also more streamlined in that there is only one major 
language covering all areas of financial application building. It also avoids the inher- 
ent inefficiencies and redundancies when using different technologies in different 
steps of the development process. All in all, Python can provide a consistent techno- 
logical framework for almost all tasks in financial analytics, financial application 
development, and algorithm implementation. 


Data-Driven and Al-First Finance 


Basically all the observations regarding the relationship of technology and the finan- 
cial industry first formulated in 2014 for the first edition of this book still seem pretty 
current and important in August 2018, at the time of updating this chapter for the 
second edition of the book. However, this section comments on two major trends in 
the financial industry that are about to reshape it in a fundamental way. These two 
trends have mainly crystallized themselves over the last few years. 


Data-Driven Finance 


Some of the most important financial theories, such as MPT and CAPM, date as far 
back as to the 1950s and 1960s. However, they still represent a cornerstone in the 
education of students in such fields as economics, finance, financial engineering, and 
business administration. This might be surprising since the empirical support for 
most of these theories is meager at best, and the evidence is often in complete con- 
trast to what the theories suggest and imply. On the other hand, their popularity is 
understandable since they are close to humans’ expectations of how financial markets 
might behave and since they are elegant mathematical theories resting on a number 
of appealing, if in general too simplistic, assumptions. 
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The scientific method, say in physics, starts with data, for example from experiments 
or observations, and moves on to hypotheses and theories that are then tested against 
the data. If the tests are positive, the hypotheses and theories might be refined and 
properly written down, for instance, in the form of a research paper for publication. 
If the tests are negative, the hypotheses and theories are rejected and the search 
begins anew for ones that conform with the data. Since physical laws are stable over 
time, once such a law is discovered and well tested it is generally there to stay, in the 
best case, forever. 


The history of (quantitative) finance in large parts contradicts the scientific method. 
In many cases, theories and models have been developed “from scratch” on the basis 
of simplifying mathematical assumptions with the goal of discovering elegant 
answers to central problems in finance. Among others, popular assumptions in 
finance are normally distributed returns for financial instruments and linear relation- 
ships between quantities of interest. Since these phenomena are hardly ever found in 
financial markets, it should not come as a surprise that empirical evidence for the ele- 
gant theories is often lacking. Many financial theories and models have been formu- 
lated, proven, and published first and have only later been tested empirically. To 
some extent, this is of course due to the fact that financial data back in the 1950s to 
the 1970s or even later was not available in the form that it is today even to students 
getting started with a bachelor’s in finance. 


The availability of such data to financial institutions has drastically increased since 
the early to mid-1990s, and nowadays even individuals doing financial research or 
getting involved in algorithmic trading have access to huge amounts of historical data 
down to the tick level as well as real-time tick data via streaming services. This allows 
us to return to the scientific method, which starts in general with the data before 
ideas, hypotheses, models, and strategies are devised. 


A brief example shall illustrate how straightforward it has become today to retrieve 
professional data on a large scale even on a local machine, making use of Python and 
a professional data subscription to the Eikon Data APIs. The following example 
retrieves tick data for the Apple Inc. stock for one hour during a regular trading day. 
About 15,000 tick quotes, including volume information, are retrieved. While the 
symbol for the stock is AAPL, the Reuters Instrument Code (RIC) is AAPL. 0O: 


In [26]: import eikon as ek 1) 


In [27]: data = ek.get_timeseries('AAPL.O', fields='*', 
start_date='2018-10-18 16:00:00', 
end_date='2018-10-18 17:00:00', 
interval='tick') (2) 


In [28]: data.info() (2) 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 35350 entries, 2018-10-18 16:00:00.002000 to 2018-10-18 
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16:59:59.888000 
Data columns (total 2 columns): 
VALUE 35285 non-null float64 
VOLUME 35350 non-null float64 
dtypes: float64(2) 
memory usage: 828.5 KB 


In [29]: data.tail() © 

Out[29]: AAPL.O VALUE VOLUME 
Date 
2018-10-18 16:59:59.433 217.13 10.0 
2018-10-18 16:59:59.433 217.13 12.0 
2018-10-18 16:59:59.439 217.13 231.0 
2018-10-18 16:59:59.754 217.14 100.0 
2018-10-18 16:59:59.888 217.13 100.0 


@ Eikon Data API usage requires a subscription and an API connection. 
© Retrieves the tick data for the Apple Inc. (AAPL.0) stock. 


© Shows the last five rows of tick data. 


The Eikon Data APIs give access not only to structured financial data, such as histori- 
cal price data, but also to unstructured data such as news articles. The next example 
retrieves metadata for a small selection of news articles and shows the beginning of 
one of the articles as full text: 


In [30]: news = ek.get_news_headlines('R:AAPL.O Language:LEN', 
date_from='2018-05-01', 
date_to='2018-06-29', 
count=7) (1) 


In [31]: news (13 
Out[31]: 
versionCreated \ 

2018-06-28 23:00:00.000 2018-06-28 23:00:00.000 
2018-06-28 21:23:26.526 2018-06-28 21:23:26.526 
2018-06-28 19:48:32.627 2018-06-28 19:48:32.627 
2018-06-28 17:33:10.306 2018-06-28 17:33:10.306 
2018-06-28 17:33:07.033 2018-06-28 17:33:07.033 
2018-06-28 17:31:44.960 2018-06-28 17:31:44.960 
2018-06-28 17:00:00.000 2018-06-28 17:00:00.000 


text \ 
2018-06-28 23:00:00.000 RPT-FOCUS-AI ambulances and robot doctors: Chi... 
2018-06-28 21:23:26.526 Why Investors Should Love Apple's (AAPL) TV En... 
2018-06-28 19:48:32.627 Reuters Insider - Trump: We're reclaiming our ... 
2018-06-28 17:33:10.306 Apple v. Samsung ends not with a whimper but a... 
2018-06-28 17:33:07.033 Apple's trade-war discount extended for anothe... 
2018-06-28 17:31:44.960 Other Products: Apple's fast-growing island of... 
2018-06-28 17:00:00.000 Pokemon Go creator plans to sell the tech behi... 
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© © O 8 8 


storyId \ 


2018-06-28 23:00:00.000 urn:newsml:reuters.com: 20180628 :nL4N1TU4F8:6 


2018-06-28 21:23:26.526 urn:newsml:reuters.com:20180628:nNRA6e2vft: 
2018-06-28 19:48:32.627 urn:newsml:reuters.com:20180628:nRTV1vNwip: 
2018-06-28 17:33:10.306 urn:newsml:reuters.com:20180628:nNRAG6eloza: 
2018-06-28 17:33:07.033 urn:newsml:reuters.com:20180628:nNRA6e1pmv: 
2018-06-28 17:31:44.960 urn:newsml:reuters.com:20180628:nNRA6e1m3n: 
2018-06-28 17:00:00.000 urn:newsml:reuters.com:20180628:nL1N1TUOPC: 


WRPRRPRRPPR 


sourceCode 


2018-06-28 23:00:00.000 NS:RTRS 
2018-06-28 21:23:26.526 NS:ZACKSC 
2018-06-28 19:48:32.627 NS: CNBC 
2018-06-28 17:33:10.306 NS:WALLST 
2018-06-28 17:33:07.033 NS:WALLST 
2018-06-28 17:31:44.960 NS:WALLST 
2018-06-28 17:00:00.000 NS:RTRS 


[32]: 
[33]: 
[34]: 


[35]: 


story_html = ek.get_news_story(news.iloc[1, 2]) (2) 
from bs4 import BeautifulSoup © 
story = BeautifulSoup(story_html, 'html5lib').get_text() (4) 


print(story[83:958]) (5) 

Jun 28, 2018 For years, investors and Apple AAPL have been beholden to 
the iPhone, which is hardly a negative since its flagship product is 
largely responsible for turning Apple into one of the world's biggest 
companies. But Apple has slowly pushed into new growth areas, with 
streaming television its newest frontier. So let's take a look at what 
Apple has planned as it readies itself to compete against the likes of 
Netflix NFLX and Amazon AMZN in the battle for the new age of 
entertainment.Apple's second-quarter revenues jumped by 16% to reach 
$61.14 billion, with iPhone revenues up 14%. However, iPhone unit sales 
climbed only 3% and iPhone revenues accounted for over 62% of total Q2 
sales. Apple knows this is not a sustainable business model, because 
rare is the consumer product that can remain in vogue for decades. This 
is why Apple has made a big push into news, 


Retrieves metadata for a small selection of news articles. 


Retrieves the full text of a single article, delivered as an HTML document. 


Imports the BeautifulSoup HTML parsing package and ... 


... extracts the contents as plain text (a str object). 


Prints the beginning of the news article. 
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Although just scratching the surface, these two examples illustrate that structured 
and unstructured historical financial data is available in a standardized, efficient way 
via Python wrapper packages and data subscription services. In many circumstances, 
similar data sets can be accessed for free even by individuals who make use of, for 
instance, trading platforms such as the one by FXCM Group, LLC, that is introduced 
in Chapter 14 and also used in Chapter 16. Once the data is on the Python level— 
independent from the original source—the full power of the Python data analytics 
ecosystem can be harnessed. 


Data-Driven Finance 


Data is what drives finance these days. Even some of the largest 
and often most successful hedge funds call themselves “data- 
driven” instead of “finance-driven.” More and more offerings are 
making huge amounts of data available to large and small institu- 
tions and individuals. Python is generally the programming lan- 
guage of choice to interact with the APIs and to process and 
analyze the data. 


Al-First Finance 


With the availability of large amounts of financial data via programmatic APIs, it has 
become much easier and more fruitful to apply methods from artificial intelligence 
(AI) in general and from machine and deep learning (ML, DL) in particular to finan- 
cial problems, such as in algorithmic trading. 


Python can be considered a first-class citizen in the AI world as well. It is often the 
programming language of choice for AI researchers and practitioners alike. In that 
sense, the financial domain benefits from developments in diverse fields, sometimes 
not even remotely connected to finance. As one example consider the TensorFlow 
open source package for deep learning, which is developed and maintained by Goo- 
gle Inc. and used by (among others) its parent company Alphabet Inc. in its efforts to 
build, produce, and sell self-driving cars. 


Although for sure not even remotely related to the problem of automatically, algo- 
rithmically trading stock, TensorFlow can, for example, be used to predict move- 
ments in financial markets. Chapter 15 provides a number of examples in this regard. 


One of the most widely used Python packages for ML is scikit- learn. The code that 
follows shows how, in a highly simplified manner, classification algorithms from ML 
can be used to predict the direction of future market price movements and to base an 
algorithmic trading strategy on those predictions. All the details are explained in 
Chapter 15, so the example is therefore rather concise. First, the data import and the 
preparation of the features data (directional lagged log return data): 
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In [36]: 


In [37]: 


In [38]: 


In [39]: 


import numpy as np 
import pandas as pd 


data = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True) 

data = pd.DataFrame(data['AAPL.0O']) 1] 

data['Returns'] = np.log(data / data.shift()) (2) 

data.dropna(inplace=True) 


lags = 6 


cols = [] 

for lag in range(1, lags + 1): 
col = 'lag_{}'.format(lag) 
data[col] = np.sign(data['Returns'].shift(lag)) © 
cols.append(col) 

data.dropna(inplace=True) 


@ Selects historical end-of-day data for the Apple Inc. stock (AAPL.0). 


© Calculates the log returns over the complete history. 


© Generates DataFrame columns with directional lagged log return data (+1 or -1). 


Next, the instantiation of a model object for a support vector machine (SVM) algo- 
rithm, the fitting of the model, and the prediction step. Figure 1-2 shows that the 
prediction-based trading strategy, going long or short on Apple Inc. stock depending 
on the prediction, outperforms the passive benchmark investment in the stock itself: 


In [40]: 
In [41]: 


In [42]: 
Out[42]: 


In [43]: 
In [44]: 


In [45]: 


from sklearn.svm import SVC 

model = SVC(gamma='auto' ) 1) 

model.fit(data[cols], np.sign(data[ 'Returns'])) (2) 

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, 
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', 
max_iter=-1, probability=False, random_state=None, shrinking=True, 
tol=0.001, verbose=False) 

data['Prediction'] = model.predict(data[cols]) © 

data['Strategy'] = data['Prediction'] * data['Returns'] (4) 


data[['Returns', 'Strategy']].cumsum().apply(np.exp).plot( 
figsize=(10, 6)); © 


@ Instantiates the model object. 


@ Fits the model, given the features and the label data (all directional). 
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© Uses the fitted model to create the predictions (in-sample), which are the posi- 
tions of the trading strategy at the same time (long or short). 


© Calculates the log returns of the trading strategy given the prediction values and 
the benchmark log returns. 


© Plots the performance of the ML-based trading strategy compared to the perfor- 
mance of the passive benchmark investment. 
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Figure 1-2. ML-based algorithmic trading strategy vs. passive benchmark investment in 
Apple Inc. stock 


The simplified approach taken here does not account for transaction costs, nor does 
it separate the data set into training and testing subsets. However, it shows how 
straightforward the application of ML algorithms to financial data is, at least in a 
technical sense; practically, a number of important topics need to be considered (see 
López de Prado (2018)). 


Al-First Finance 


AI will reshape finance in a way that other fields have been resha- 
ped already. The availability of large amounts of financial data via 
programmatic APIs functions as an enabler in this context. Basic 
methods from AI, ML, and DL are introduced in Chapter 13 and 
applied to algorithmic trading in Chapters 15 and 16. A proper 
treatment of Al-first finance, however, would require a book fully 
dedicated to the topic. 
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AI in finance, as a natural extension of data-driven finance, is for sure a fascinating 
and exciting field, both from a research and a practitioner’s point of view. Although 
this book uses several methods from AI, ML, and DL in different contexts, overall the 
focus lies—in line with the subtitle of the book—on the fundamental Python techni- 
ques and approaches needed for data-driven finance. These are, however, equally 
important for AI-first finance. 


Conclusion 


Python as a language—and even more so as an ecosystem—is an ideal technological 
framework for the financial industry as whole and the individual working in finance 
alike. It is characterized by a number of benefits, like an elegant syntax, efficient 
development approaches, and usability for prototyping as well as production. With 
its huge amount of available packages, libraries, and tools, Python seems to have 
answers to most questions raised by recent developments in the financial industry in 
terms of analytics, data volumes and frequency, compliance and regulation, as well as 
technology itself. It has the potential to provide a single, powerful, consistent frame- 
work with which to streamline end-to-end development and production efforts even 
across larger financial institutions. 


In addition, Python has become the programming language of choice for artificial 
intelligence in general and machine and deep learning in particular. Python is there- 
fore the right language for data-driven finance as well as for Al-first finance, two 
recent trends that are about to reshape finance and the financial industry in funda- 
mental ways. 


Further Resources 


The following books cover several aspects only touched upon in this chapter in more 
detail (e.g., Python tools, derivatives analytics, machine learning in general, and 
machine learning in finance): 


e Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


e Lopez de Prado, Marcos (2018). Advances in Financial Machine Learning. 
Hoboken, NJ: John Wiley & Sons. 


e VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 


When it comes to algorithmic trading, the author’s company offers a range of online 
training programs that focus on Python and other tools and techniques required in 
this rapidly growing field: 


Conclusion | 31 


 http://pyalgo.tpq.io 
° http://certificate.tpq.io 


Sources referenced in this chapter are, among others, the following: 


e Ding, Cubillas (2010). “Optimizing the OTC Pricing and Valuation Infrastruc- 
ture.” Celent. 

e Lewis, Michael (2014). Flash Boys. New York: W. W. Norton & Company. 

e Patterson, Scott (2010). The Quants. New York: Crown Business. 
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CHAPTER 2 
Python Infrastructure 


In building a house, there is the problem of the selection of wood. 


It is essential that the carpenter’s aim be to carry equipment that will cut well and, 
when he has time, to sharpen that equipment. 


—Miyamoto Musashi (The Book of Five Rings) 


For someone new to Python, Python deployment might seem all but straightforward. 
The same holds true for the wealth of libraries and packages that can be installed 
optionally. First of all, there is not only one Python. Python comes in many different 
flavors, like CPython, Jython, IronPython, and PyPy. Then there is the divide 
between Python 2.7 and the 3.x world.’ 


Even after you've decided on a version, deployment is difficult for a number of addi- 
tional reasons: 


e The interpreter (a standard CPython installation) only comes with the so-called 
standard library (e.g., covering typical mathematical functions) 


e Optional Python packages need to be installed separately—and there are hun- 
dreds of them 


e Compiling/building such nonstandard packages on your own can be tricky due 
to dependencies and operating system-specific requirements 

e Taking care of these dependencies and of version consistency over time (i.e., 
maintenance) is often tedious and time consuming 


1 This edition is based on version 3.7 (the latest major release at the time of writing) of CPython, the original 
and most popular version of the Python programming language. 
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e Updates and upgrades for certain packages might necessitate recompiling a mul- 
titude of other packages 


e Changing or replacing one package might cause trouble in (many) other places 


Fortunately, there are tools and strategies available that can help. This chapter covers 
the following types of technologies that help with Python deployment: 


Package managers 
Package managers like pip and conda help with the installing, updating, and 
removing of Python packages; they also help with version consistency of differ- 
ent packages. 


Virtual environment managers 
A virtual environment manager like virtualenv or conda allows you to manage 
multiple Python installations in parallel (e.g., to have both a Python 2.7 and 3.7 
install on a single machine or to test the most recent development version of a 
fancy Python package without risk).” 


Containers 
Docker containers represent complete filesystems containing all the pieces of a 
system needed to run certain software, like code, runtime, or system tools. For 
example, you can run an Ubuntu 18.04 operating system with a Python 3.7 install 
and the respective Python code in a Docker container hosted on a machine run- 
ning macOS or Windows 10. 


Cloud instances 

Deploying Python code for financial applications generally requires high availa- 
bility, security, and also performance; these requirements can typically only be 
met by the use of professional compute and storage infrastructure that is nowa- 
days available at attractive conditions in the form of fairly small to really large 
and powerful cloud instances. One benefit of a cloud instance (i.e., a virtual 
server) compared to a dedicated server rented longer-term is that users generally 
get charged only for the hours of actual usage; another advantage is that such 
cloud instances are available literally in a minute or two if needed, which helps 
with agile development and also with scalability. 


The structure of this chapter is as follows: 


“conda as a Package Manager” on page 35 
This section introduces conda as a package manager for Python. 


2 A recent project called pipenv combines the capabilities of the package manager pip with those of the virtual 
environment manager virtualenv. 
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“conda as a Virtual Environment Manager” on page 41 
This section focuses on conda’s capabilities as a virtual environment manager. 


“Using Docker Containers” on page 45 
This section gives a brief overview of Docker as a containerization technology 
and focuses on the building of an Ubuntu-based container with a Python 3.7 
installation. 


“Using Cloud Instances” on page 50 
The section shows how to deploy Python and Jupyter Notebook—a powerful, 
browser-based tool suite for Python development—in the cloud. 


The goal of this chapter is to set up a proper Python installation with the most impor- 
tant tools as well as numerical, data analysis, and visualization packages on a profes- 
sional infrastructure. This combination then serves as the backbone for 
implementing and deploying the Python code in later chapters, be it interactive 
financial analytics code or code in the form of scripts and modules. 


conda as a Package Manager 


Although conda can be installed standalone, an efficient way of doing it is via Mini- 
conda, a minimal Python distribution including conda as a package and virtual envi- 
ronment manager. 


Installing Miniconda 


Miniconda is available for Windows, macOS, and Linux. You can download the dif- 
ferent versions from the Miniconda webpage. In what follows, the Python 3.7 64-bit 
version is assumed. The main example in this section is a session in an Ubuntu-based 
Docker container which downloads the Linux 64-bit installer via wget and then 
installs Miniconda. The code as shown should work—perhaps with minor modifica- 
tions—on any other Linux- or macOS-based machine as well: 


$ docker run -ti -h py4fi -p 11111:11111 ubuntu: latest /bin/bash 
root@py4fi:/# apt-get update; apt-get upgrade -y 
root@py4fi:/# apt-get install -y bzip2 gcc wget 


root@py4fi:/# cd root 

root@py4fi:~# wget \ 

> https: //repo.continuum.io/miniconda/Miniconda3-lLatest-Linux-x86_64.sh \ 
> -0 miniconda.sh 


HTTP request sent, awaiting response... 200 OK 
Length: 62574861 (60M) [application/x-sh] 
Saving to: 'miniconda.sh' 
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miniconda.sh 100%[====================>] 59.68M 5.97MB/s in 11s 
2018-09-15 09:44:28 (5.42 MB/s) - 'miniconda.sh' saved [62574861/62574861] 
root@py4fi:~# bash miniconda.sh 

Welcome to Miniconda3 4.5.11 

In order to continue the installation process, please review the license 
agreement. 


Please, press ENTER to continue 
>>> 


Simply pressing the Enter key starts the installation process. After reviewing the 
license agreement, approve the terms by answering yes: 


Do you accept the license terms? [yes|no] 
[no] >>> yes 


Miniconda3 will now be installed into this location: 
/root/miniconda3 


- Press ENTER to confirm the Location 
- Press CTRL-C to abort the installation 
- Or specify a different location below 


[/root/miniconda3] >>> 
PREFIX=/root/miniconda3 
installing: python-3.7. 


installing: requests-2.19.1-py37_0 ... 
installing: conda-4.5.11-py37_0 ... 
installation finished. 


After you have agreed to the licensing terms and have confirmed the install location 
you should allow Miniconda to prepend the new Miniconda install location to the 
PATH environment variable by answering yes once again: 


Do you wish the installer to prepend the Miniconda3 install Location 
to PATH in your /root/.bashre ? [yes|no] 
[no] >>> yes 


Appending source /root/miniconda3/bin/activate to /root/.bashrc 
A backup will be made to: /root/.bashrc-miniconda3.bak 
For this change to become active, you have to open a new terminal. 


Thank you for installing Miniconda3! 
root@py4fi:~# 
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After that, you might want to upgrade conda as well as Python:* 


root@py4fi: 
root@py4fi: 
root@py4fi: 
root@py4fi: 


~# export PATH="/root/miniconda3/bin/:$PATH" 
~# conda update -y conda python 

~# echo " 
~# bash 


. /root/miniconda3/etc/profile.d/conda.sh" >> ~/.bashre 


After this rather simple installation procedure, you'll have a basic Python install as 


well as conda 


available. The basic Python install comes with some nice batteries 


included, like the SQLite3 database engine. You might try out whether you can start 
Python in a new shell instance after appending the relevant path to the respective 
environment variable (as done previously): 


root@py4fi: 
Python 3.7. 


~# python 
© (default, Jun 28 2018, 13:15:42) 


[GCC 7.2.0] :: Anaconda, Inc. on linux 


Type "help 
>>> print(' 


', "copyright", "credits" or "License" for more information. 


Hello Python for Finance World.') 


Hello Python for Finance World. 


>>> exit() 


root@py4fi: 


~# 


Basic Operations with conda 


conda can be used to efficiently handle, among other things, the installing, updating, 
and removing of Python packages. The following list provides an overview of the 
major functions: 


Installing Python x.x 
conda install python=x. x 


Updating Python 
conda update python 


Installing a package 
conda install $PACKAGE_NAME 


Updating a package 
conda update $PACKAGE_NAME 


Removing a package 
conda remove SPACKAGE_NAME 


Updating conda itself 
conda update conda 


3 The Miniconda installer is in general not as regularly updated as conda and Python themselves. 
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Searching for packages 
conda search S$SEARCH_TERM 


Listing installed packages 
conda list 


Given these capabilities, installing, for example, NumPy—one of the most important 
libraries of the so-called scientific stack—requires a single command only. When the 
installation takes place on a machine with an Intel processor, the procedure automat- 
ically installs the Intel Math Kernel Library (mkl), which speeds up numerical opera- 
tions not only for NumPy but also for a few other scientific Python packages:* 


root@py4fi:~# conda install numpy 
Solving environment: done 


## Package Plan ## 
environment location: /root/miniconda3 
added / updated specs: 


- numpy 


The following packages will be downloaded: 


package | build 
N AEAEE E EE EE temas 
mkl-2019.0 | 117 204.4 MB 
intel-openmp-2019.0 | 117 721 KB 
mkl_random-1.0.1 | py37h4414c95_1 372 KB 
Libgfortran-ng-7.3.0 | hdf63c60_0 1.3 MB 
numpy-1.15.1 | py37h1d66e8a_0 37 KB 
numpy-base-1.15.1 | py37h81de0dd_0 4.2 MB 
blas-1.0 | mkl 6 KB 
mkl_fft-1.0.4 | py37h4414c95_1 149 KB 
Total: 211.1 MB 


The following NEW packages will be INSTALLED: 


blas: 1.0-mkl 
intel-openmp: 2019.0-117 
libgfortran-ng: 7.3.0-hdf63c60_0 


mkl: 2019.0-117 
mkl_fft: 1.0.4-py37h4414c95_1 
mkl_random: 1.0.1-py37h4414c95_1 


4 Installing the metapackage nomkl, e.g. with conda install numpy nomkl, avoids the automatic installation 
and usage of mkl and related other packages. 
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15.1-py37h1d66e8a_0 
15.1-py37h81de0dd_0 


numpy: 


1: 
numpy-base: T; 


Proceed ([y]/n)? y 


Downloading and Extracting Packages 


mkl-2019.0 | 204.4 MB | ##RRHHHHHHHRRRRHHHHHHHHHHHHHHHHHHHHHHHH | 100% 
numpy-1.15.1 | 37 KB | AHHH | 100% 
numpy-base-1.15.1 | 4.2 MB | FAH | 100% 


root@py4fi:~# 


Multiple packages can also be installed at once. The -y flag indicates that all (poten- 
tial) questions shall be answered with yes: 


root@py4fi:/# conda install -y ipython matplotlib pandas pytables scikit-learn \ 


> scipy 

pytables-3.4.4 | 1.5 MB | RB | 100% 
kiwisolver-1.0.1 | 83 KB | RR ee | 100% 
icu-58.2 | 22.5 MB | GRRE | 100% 


Preparing transaction: done 

Verifying transaction: done 

Executing transaction: done 

root@py4fi:~# 
After the resulting installation procedure, some of the most important libraries for 
financial analytics are available in addition to the standard ones. These include: 


IPython 
An improved interactive Python shell 


matplotlib 
The standard plotting library in Python 


NumPy 
For efficient handling of numerical arrays 


pandas 
For management of tabular data, like financial time series data 


PyTables 
A Python wrapper for the HDF5 library 


scikit-learn 
A package for machine learning and related tasks 


SciPy 
A collection of scientific classes and functions (installed as a dependency) 
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This provides a basic tool set for data analysis in general and financial analytics in 
particular. The next example uses IPython and draws a set of pseudo-random num- 
bers with NumPy: 


root@py4fi:~# ipython 

Python 3.7.0 (default, Jun 28 2018, 13:15:42) 

Type 'copyright', 'credits' or 'license' for more information 
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help. 
In [1]: import numpy as np 

In [2]: np.random.seed(100) 


In [3]: np.random.standard_normal((5, 4)) 


Out[3]: 

array([[-1.74976547, 0.3426804 , 1.1530358 , -0.25243604], 
[ ©.98132079, 0©.51421884, ©.22117967, -1.07004333], 
[-0.18949583, 0.25500144, -0.45802699, 0.43516349], 
[-0.58359505, 0©.81684707, 0©.67272081, -0.10441114], 
[-0.53128038, 1.02973269, -0.43813562, -1.11831825]]) 

In [4]: exit 


root@py4fi:~# 
Executing conda list shows which packages are installed: 


root@py4fi:~# conda list 
# packages in environment at /root/miniconda3: 


# 

# Name Version Build Channel 
asnicrypto 0.24.0 py37_0 
backcall 0.1.0 py37_0 
blas 1.0 mkl 
blosc 1.14.4 hdbcaa40_0 
bzip2 1.0.6 h14c3975_5 
python 3.7.0 hc3d631a_0 
wheel 0.31.1 py37_0 
XZ 5.2.4 h14c3975_4 
yaml 0.1.7 had09818 2 
zlib 1.2.11 ha838bed_2 


root@py4fi:~# 
If a package is not needed anymore, it is efficiently removed with conda remove: 


root@py4fi:~# conda remove scikit-learn 
Solving environment: done 


## Package Plan ## 


environment location: /root/miniconda3 
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removed specs: 
- scikit-learn 


The following packages will be REMOVED: 
scikit-learn: 0.19.1-py37hedc7406_0 
Proceed ([y]/n)? y 


Preparing transaction: done 

Verifying transaction: done 

Executing transaction: done 

root@py4fi:~# 
conda as a package manager is already quite useful. However, its full power only 
becomes evident when adding virtual environment management to the mix. 


Easy Package Management 


Using conda as a package manager makes installing, updating, and 
removing Python packages a pleasant experience. There is no need 
to take care of building and compiling packages on your own— 
which can be tricky sometimes, given the list of dependencies a 
package specifies and the specifics to be considered on different 
operating systems. 


conda as a Virtual Environment Manager 


Depending on the version of the installer you choose, Miniconda provides a default 
Python 2.7 or 3.7 installation. The virtual environment management capabilities of 
conda allow one, for example, to add to a Python 3.7 default installation a completely 
separate installation of Python 2.7.x. To this end, conda offers the following function- 
ality: 


Creating a virtual environment 
conda create --name SENVIRONMENT_NAME 


Activating an environment 
conda activate SENVIRONMENT_NAME 


Deactivating an environment 
conda deactivate SENVIRONMENT_NAME 


Removing an environment 
conda env remove --name SENVIRONMENT_NAME 
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Exporting to an environment file 
conda env export > SFILE_NAME 


Creating an environment from a file 
conda env create -f SFILE_NAME 


Listing all environments 
conda info --envs 


As a simple illustration, the example code that follows creates an environment called 
py27, installs [Python, and executes a line of Python 2.7.x code: 


root@py4fi:~# conda create --name py27 python=2.7 
Solving environment: done 


## Package Plan ## 
environment Location: /root/miniconda3/envs/py27 
added / updated specs: 
- python=2.7 
The following NEW packages will be INSTALLED: 
ca-certificates: 2018.03.07-0 
python: 2.7.15-h1571d57_0 
zlib: 1.2.11-ha838bed_2 
Proceed ([y]/n)? y 
Preparing transaction: done 
Verifying transaction: done 


Executing transaction: done 


To activate this environment, use: 
> conda activate py27 


To deactivate an active environment, use: 
> conda deactivate 


H+ HH HH HH 


root@py4fi:~# 


Notice how the prompt changes to include (py27) after the activation of the 
environment: 
root@py4fi:~# conda activate py27 


(py27) root@py4fi:~# conda install ipython 
Solving environment: done 
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Executing transaction: done 
(py27) root@py4fi:~# 


Finally, this allows you to use [Python with Python 2.7 syntax: 


(py27) root@py4fi:~# ipython 
Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 23:32:55) 
Type "copyright", "credits" or "License" for more information. 


IPython 5.8.0 -- An enhanced Interactive Python. 


? -> Introduction and overview of IPython's features. 
%quickref -> Quick reference. 

help -> Python's own help system. 

object? -> Details about 'object', use 'object??' for extra details. 


In [1]: print "Hello Python for Finance World!" 
Hello Python for Finance World! 


In [2]: exit 

(py27) root@py4fi:~# 
As this example demonstrates, using conda as a virtual environment manager allows 
you to install different Python versions alongside each other. It also allows you to 
install different versions of certain packages. The default Python install is not influ- 
enced by such a procedure, nor are other environments which might exist on the 
same machine. All available environments can be shown via conda env list: 


(py27) root@py4fi:~# conda env list 
# conda environments: 


# 
base /root/miniconda3 
py27 * /root/miniconda3/envs/py27 


(py27) root@py4fi:~# 


Sometimes it is necessary to share environment information with others or to use 
environment information on multiple machines. To this end, one can export the 
installed packages list to a file with conda env export. This only works properly by 
default if the machines use the same operating system, since the build versions are 
specified in the resulting YAML file, but they can be deleted to only specify the pack- 
age version: 


(py27) root@py4fi:~# conda env export --no-builds > py27env.yml 
(py27) root@py4fi:~# cat py27env.yml 
name: py27 
channels: 
- defaults 
dependencies: 
- backports=1.0 


- python=2.7.15 
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- zlib=1.2.11 
prefix: /root/miniconda3/envs/py27 


(py27) root@py4fi:~# 


Often a virtual environment, which is technically not that much more than a certain 
(sub)folder structure, is created to do some quick tests.° In such a case, the environ- 
ment is easily removed after deactivation via conda env remove: 


(py27) root@py4fi:/# conda deactivate 
root@py4fi:~# conda env remove -y --name py27 


Remove all packages in environment /root/miniconda3/envs/py27: 


## Package Plan ## 


environment Location: /root/miniconda3/envs/py27 


The following packages will be REMOVED: 
backports: 1.0-py27_1 
zlib: 1.2.11-ha838bed_2 


root@py4fi:~# 


This concludes the overview of conda as a virtual environment manager. 


Easy Environment Management 


conda does not only help with managing packages; it is also a vir- 
tual environment manager for Python. It simplifies the creation of 
different Python environments, allowing you to have multiple ver- 
sions of Python and optional packages available on the same 
machine without them influencing each other in any way. conda 
also allows you to export environment information so you can 
easily replicate it on multiple machines or share it with others. 


5 In the official documentation you find the following explanation: “Python ‘Virtual Environments’ allow 
Python packages to be installed in an isolated location for a particular application, rather than being installed 
globally.” 
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Using Docker Containers 


Docker containers have taken the IT world by storm. Although the technology is still 
relatively young, it has established itself as one of the benchmarks for the efficient 
development and deployment of almost any kind of software application. 


For the purposes of this book it suffices to think of a Docker container as a separate 
(“containerized”) filesystem that includes an operating system (e.g., Ubuntu Server 
18.04), a (Python) runtime, additional system and development tools, as well as fur- 
ther (Python) libraries and packages as needed. Such a Docker container might run 
on a local machine with Windows 10 or on a cloud instance with a Linux operating 
system, for instance. 


This section does not go into all the exciting details of Docker containers. It is rather 
a concise illustration of what the Docker technology can do in the context of Python 
deployment.‘ 


Docker Images and Containers 


However, before moving on to the illustration, two fundamental concepts need to be 
distinguished when talking about Docker. The first is a Docker image, which can be 
compared to a Python class. The second is a Docker container, which can be com- 
pared to an instance of the respective Python class.” 


On a more technical level, you find the following definition for an image in the 
Docker glossary: 


Docker images are the basis of containers. An Image is an ordered collection of root 
filesystem changes and the corresponding execution parameters for use within a con- 
tainer runtime. An image typically contains a union of layered filesystems stacked on 
top of each other. An image does not have state and it never changes. 


Similarly, you find the following definition for a container in the Docker glossary, 
which makes the analogy to Python classes and instances of such classes transparent: 


A container is a runtime instance of a Docker image. A Docker container consists of: a 
Docker image, an execution environment, and a standard set of instructions. 
Depending on the operating system, the installation of Docker is somewhat different. 
That is why this section does not go into the details. More information and further 
links are found on the About Docker CE page. 


6 See Matthias and Kane (2015) for a comprehensive introduction to the Docker technology. 


7 If the terms are not yet clear, they will become so in Chapter 6. 
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Building an Ubuntu and Python Docker Image 


This section illustrates the building of a Docker image based on the latest version of 
Ubuntu, which includes Miniconda as well as a few important Python packages. In 
addition, it does some Linux housekeeping by updating the Linux packages index, 
upgrading packages if required, and installing certain additional system tools. To this 
end, two scripts are needed. One is a bash script that does all the work on the Linux 
level.’ The other is a so-called Dockerfile, which controls the building procedure for 
the image itself. 


The bash script in Example 2-1 that does the installing consists of three major parts. 
The first part handles the Linux housekeeping. The second part installs Miniconda, 
while the third part installs optional Python packages. There are also more detailed 
comments inline. 


Example 2-1. Script installing Python and optional packages 
#!/bin/bash 


Script to Install 
Linux System Tools and 
Basic Python Components 


# 
# 
# 
# 
# 
# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

# GENERAL LINUX 

apt-get update # updates the package index cache 
apt-get upgrade -y # updates packages 

# installs system tools 

apt-get install -y bzip2 gcc git htop screen vim wget 
apt-get upgrade -y bash # upgrades bash if necessary 
apt-get clean # cleans up the package index cache 


# INSTALL MINICONDA 

# downloads Miniconda 

wget https://repo.continuum. io/miniconda/Miniconda3-latest-Linux-x86_64.sh -0 \ 
Miniconda.sh 

bash Miniconda.sh -b # installs it 

rm -rf Miniconda.sh # removes the installer 

export PATH="/root/miniconda3/bin:$PATH" # prepends the new path 


# INSTALL PYTHON LIBRARIES 
conda update -y conda python # updates conda & Python (if required) 


8 Consult Robbins (2016) for a concise introduction to and quick overview of bash scripting. Also see https:// 
www.gnu.org/software/bash. 
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conda install -y pandas # installs pandas 
conda install -y ipython # installs IPython shell 


The Dockerfile in Example 2-2 uses the bash script in Example 2-1 to build a new 


Docker image. It also has its major parts commented inline. 


Example 2-2. Dockerfile to build the image 


Building a Docker Image with 
the Latest Ubuntu Version and 
Basic Python Install 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


RRR HR HR 


# latest Ubuntu version 
FROM ubuntu: Latest 


# information about maintainer 
MAINTAINER yves 


# add the bash script 
ADD install.sh / 


# change rights for the script 
RUN chmod u+x /install.sh 


# run the bash script 
RUN /install.sh 


# prepend the new path 
ENV PATH /root/miniconda3/bin:$PATH 


# execute IPython when container is run 
CMD ["ipython"] 


If these two files are in a single folder and Docker is installed, then the building of the 
new Docker image is straightforward. Here, the tag py4fi:basic is used for the 
image. This tag is needed to reference the image, for example when running a con- 
tainer based on it: 


~/Docker$ docker build -t py4fi:basic . 


Removing intermediate container 5fec0c9b2239 
---> accee128d9e9 

Step 6/7 : ENV PATH /root/miniconda3/bin:$PATH 
---> Running in a2bb97686255 
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Removing intermediate container a2bb97686255 
---> 73b00c215351 

Step 7/7 : CMD ["ipython"] 
---> Running in ec7acd90c991 

Removing intermediate container ec7acd90c991 
---> 6¢€36b9117cd2 

Successfully built 6c36b9117cd2 

Successfully tagged py4fi:basic 

~/Docker$ 


Existing Docker images can be listed via docker images. The new image should be at 
the top of the list: 


(py4fi) ~/Docker$ docker images 


REPOSITORY TAG IMAGE ID CREATED SIZE 
py4fi basic 6c36b9117cd2 About a minute ago 1.79GB 
ubuntu latest cd6d8154f1e1 9 days ago 84.1MB 


(py4fi) ~/Docker$ 


Successfully building the py4fi:basic allows you to run the respective Docker con- 
tainer with docker run. The parameter combination -ti is needed for interactive 
processes running within a Docker container, like a shell process (see the docker run 
reference page): 


~/Docker$ docker run -ti py4fi:basic 

Python 3.7.0 (default, Jun 28 2018, 13:15:42) 

Type 'copyright', 'credits' or 'license' for more information 
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help. 


In [1]: import numpy as np 

In [2]: a = np.random.standard_normal((5, 3)) 

In [3]: import pandas as pd 

In [4]: df = pd.DataFrame(a, columns=['a', 'b', 'c']) 


In [5]: df 
Out[5]: 

a b c 
-1.412661 -0.881592 1.704623 
-1.294977 0.546676 1.027046 

1.156361 1.979057 0.989772 
0.546736 -0.479821 0.693907 
-1.972943 -0.193964 0.769500 


BRWNP © 


In [6]: 


Exiting [Python will exit the container as well since it is the only application running 
within the container. However, you can detach from a container by typing Ctrl-P 
+Ctrl-Q. 
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The docker ps command will still show the running container (and any other cur- 
rently running containers) after you've detached from it: 


~/Docker$ docker ps 


CONTAINER ID IMAGE COMMAND CREATED STATUS 
e815df8f0f4d py4fi:basic "ipython" About a minute ago Up About a minute 
4518917de7dc ubuntu:latest "/bin/bash" About an hour ago Up About an hour 
d081b5c7add0 ubuntu:latest "/bin/bash" 21 hours ago Up 21 hours 
~/Docker$ 


Attaching to a Docker container is accomplished with the command docker attach 
SCONTAINER_ID (notice that a few letters of the SCONTAINER_ID are enough): 


~/Docker$ docker attach e815d 


In [6]: df.info() 

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 5 entries, 0 to 4 

Data columns (total 3 columns): 

a 5 non-null float64 

b 5 non-null float64 

c 5 non-null float64 

dtypes: float64(3) 

memory usage: 200.0 bytes 


In [7]: exit 

~/Docker$ 
The exit command terminates [Python and stops the Docker container. It can be 
removed with docker rm: 


~/Docker$ docker rm e815d 
e815d 
~/Docker$ 


Similarly, the Docker image py4fi:basic can be removed via docker rmi if not 
needed any longer. While containers are relatively lightweight, single images might 
consume quite a bit of storage. In the case of the py4fi: basic image, the size is close 
to 2 GB. That is why you might want to regularly clean up the list of Docker images: 


~/DockerS docker rmi 6c36b9117cd2 


Of course, there is much more to say about Docker containers and their benefits in 
certain application scenarios. But for the purposes of this book, it’s enough to know 
that they provide a modern approach to deploy Python, to do Python development in 
a completely separate (containerized) environment, and to ship codes for algorithmic 
trading. 
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Benefits of Docker Containers 


If you are not yet using Docker containers, you should consider 
doing so. They provide a number of benefits when it comes to 
Python deployment and development efforts, not only when work- 
ing locally but in particular when working with remote cloud 
instances and servers deploying code for algorithmic trading. 


Using Cloud Instances 


This section shows how to set up a full-fledged Python infrastructure on a 
DigitalOcean cloud instance. There are many other cloud providers out there, among 
them the leading provider, Amazon Web Services (AWS). However, DigitalOcean is 
well known for its simplicity and also its relatively low rates for its smaller cloud 
instances, called Droplets. The smallest Droplet, which is generally sufficient for 
exploration and development purposes, only costs 5 USD per month or 0.007 USD 
per hour. Usage is charged by the hour so that one can easily spin up a Droplet for 2 
hours, say, destroy it afterward, and get charged just 0.014 USD. 


The goal of this section is to set up a Droplet on DigitalOcean that has a Python 3.7 
installation plus typically needed packages (e.g., NumPy, pandas) in combination with 
a password-protected and Secure Sockets Layer (SSL)-encrypted Jupyter Notebook 
server installation. This server installation will provide three major tools that can be 
used via a regular browser: 


Jupyter Notebook 
A popular interactive development environment that features a selection of dif- 
ferent language kernels (e.g., for Python, R, and Julia). 


Terminal 
A system shell implementation accessible via the browser that allows for all typi- 
cal system administration tasks and for usage of helpful tools like Vim and git. 


Editor 
A browser-based file editor with syntax highlighting for many different program- 
ming languages and file types as well as typical text/code editing capabilities. 


Having Jupyter Notebook installed on a Droplet allows you to do Python develop- 
ment and deployment via the browser, circumventing the need to log in to the cloud 
instance via Secure Shell (SSH) access. 


9 New users who sign up via this referral link get a starting credit of 10 USD for DigitalOcean. 
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To accomplish the goal of this section, a number of files are needed: 


Server setup script 
This script orchestrates all the steps necessary, like, for instance, copying other 
files to the Droplet and running them on the Droplet. 


Python and Jupyter installation script 
This installs Python, additional packages, and Jupyter Notebook, and starts the 
Jupyter Notebook server. 


Jupyter Notebook configuration file 
This file is for the configuration of the Jupyter Notebook server, e.g., with respect 
to password protection. 


RSA public and private key files 
These two files are needed for the SSL encryption of the Jupyter Notebook server. 


The following subsections work backward through this list of files. 


RSA Public and Private Keys 


In order to create a secure connection to the Jupyter Notebook server via an arbitrary 
browser, an SSL certificate consisting of RSA public and private keys is needed. In 
general, one would expect such a certificate to come from a so-called Certificate 
Authority (CA). For the purposes of this book, however, a self-generated certificate is 
“good enough.” ® A popular tool to generate RSA key pairs is OpenSSL. The brief 
interactive session that follows shows how to generate a certificate appropriate for use 
with a Jupyter Notebook server (insert your own values for the country name and 
other fields after the prompts): 


~/cloud$ openssl req -x509 -nodes -days 365 -newkey \ 
> rsa:1024 -out cert.pem -keyout cert.key 

Generating a 1024 bit RSA private key 

. .++++++ 

baskets ++++++ 

writing new private key to 'cert.key' 


You are about to be asked to enter information that will be incorporated into your 
certificate request. What you are about to enter is what is called a Distinguished 
Name or a DN. There are quite a few fields, but you can leave some blank and others 
will have a default value. If you enter ., the field will be left blank. 


Country Name (2 letter code) [AU]:DE 
State or Province Name (full name) [Some-State]:Saarland 
Locality Name (eg, city) []:Voelklingen 


10 With a self-generated certificate you might need to add a security exception when prompted by the browser. 
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Organization Name (eg, company) [Internet Widgits Pty Ltd]:TPQ GmbH 
Organizational Unit Name (eg, section) []:Python for Finance 
Common Name (e.g. server FQDN or YOUR name) []:Jupyter 
Email Address []:team@tpq.io 
~/cloud$ ls 
cert.key cert.pem 
~/cloud$ 
The two files cert.key and cert.pem need to be copied to the Droplet and need to be 


referenced by the Jupyter Notebook configuration file. This file is presented next. 


Jupyter Notebook Configuration File 


A public Jupyter Notebook server can be deployed securely as explained in the docu- 
mentation. Among other features, Jupyter Notebook can be password protected. To 
this end, there is a password hash code-generating function called passwd() available 
in the notebook.auth subpackage. The following code generates a password hash 
code with jupyter being the password itself: 


~/cloud$ ipython 

Python 3.7.0 (default, Jun 28 2018, 13:15:42) 

Type 'copyright', 'credits' or 'license' for more information 
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help. 


In [1]: from notebook.auth import passwd 


In [2]: passwd('jupyter') 
Out[2]: 'sha1:d4d34232ac3a: 55ea0ffd78cc3299e3e5e6ecc0d36be0935d424b' 
In [3]: exit 


This hash code needs to be placed in the Jupyter Notebook configuration file as pre- 
sented in Example 2-3. The configuration file assumes that the RSA key files have 
been copied on the Droplet to the /root/.jupyter/ folder. 


Example 2-3. Jupyter Notebook configuration file 


Jupyter Notebook Configuration File 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


SSL ENCRYPTION 

replace the following filenames (and files used) with your choice/files 
.NotebookApp.certfile = u'/root/.jupyter/cert.pem' 

.NotebookApp.keyfile = u'/root/.jupyter/cert.key' 


NARRER RRA 


# IP ADDRESS AND PORT 
# set ip to '*' to bind on all IP addresses of the cloud instance 
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c.NotebookApp.ip = '*' 
# it is a good idea to set a known, fixed default port for server access 
c.NotebookApp.port = 8888 


# PASSWORD PROTECTION 

# here: 'jupyter' as password 

# replace the hash code with the one for your strong password 

c.NotebookApp.password = 'sha1:d4d34232ac3a: 55ea0f fd78cc3299e3e5e6ecc0d36be0935d424b ' 


# NO BROWSER OPTION 
# prevent Jupyter from trying to open a browser 
c.NotebookApp.open_browser = False 


Jupyter and Security 


Deploying Jupyter Notebook in the cloud principally leads to a 
number of security issues since it is a full-fledged development 
environment accessible via a web browser. It is therefore of para- 
mount importance to use the security measures that a Jupyter 
Notebook server provides by default, like password protection and 
SSL encryption. But this is just the beginning; further security 
measures might be advisable depending on what exactly is done on 
the cloud instance. 


The next step is to make sure that Python and Jupyter Notebook get installed on the 
Droplet. 


Installation Script for Python and Jupyter Notebook 


The bash script to install Python and Jupyter Notebook is similar to the one presen- 
ted in “Using Docker Containers” on page 45 to install Python via Miniconda in a 
Docker container. However, the script in Example 2-4 needs to start the Jupyter 
Notebook server as well. All major parts and lines of code are commented inline. 


Example 2-4. Bash script to install Python and to run the Jupyter Notebook server 
#!/bin/bash 

Script to Install 

Linux System Tools, 

Basic Python Packages and 


Jupyter Notebook Server 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


RRR RR RR RR HR 


GENERAL LINUX 
apt-get update # updates the package index cache 
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apt-get upgrade -y # updates packages 

apt-get install -y bzip2 gcc git htop screen vim wget # installs system tools 
apt-get upgrade -y bash # upgrades bash if necessary 

apt-get clean # cleans up the package index cache 


# INSTALLING MINICONDA 

wget https://repo.continuum. io/miniconda/Miniconda3-latest-Linux-x86_64.sh -0 \ 
Miniconda.sh 

bash Miniconda.sh -b # installs Miniconda 

rm Miniconda.sh # removes the installer 

# prepends the new path for current session 

export PATH="/root/miniconda3/bin:$PATH" 

# prepends the new path in the shell configuration 

echo ". /root/miniconda3/etc/profile.d/conda.sh" >> ~/.bashre 

echo "conda activate" >> ~/.bashre 


# INSTALLING PYTHON LIBRARIES 

# More packages can/must be added 

# depending on the use case. 

conda update -y conda # updates conda if required 

conda create -y -n py4fi python=3.7 # creates an environment 
source activate py4fi # activates the new environment 

conda install -y jupyter # interactive data analytics in the browser 
conda install -y pytables # wrapper for HDF5 binary storage 
conda install -y pandas # data analysis package 

conda install -y matplotlib # standard plotting library 
conda install -y scikit-learn # machine learning library 
conda install -y openpyxl # library for Excel interaction 
conda install -y pyyaml # library to manage YAML files 


pip install --upgrade pip # upgrades the package manager 
pip install cufflinks # combining plotly with pandas 


# COPYING FILES AND CREATING DIRECTORIES 

mkdir /root/.jupyter 

mv /root/jupyter_notebook_config.py /root/.jupyter/ 
mv /root/cert.* /root/.jupyter 

mkdir /root/notebook 

cd /root/notebook 


# STARTING JUPYTER NOTEBOOK 
jupyter notebook --allow-root 


# STARTING JUPYTER NOTEBOOK 
# as background process: 
# jupyter notebook --allow-root & 


This script needs to be copied to the Droplet and needs to be started by the orchestra- 
tion script as described in the next subsection. 
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Script to Orchestrate the Droplet Setup 


The second bash script, which sets up the Droplet, is the shortest one (Example 2-5). 
It mainly copies all the other files to the Droplet, whose IP address is expected as a 
parameter. In the final line it starts the install.sh bash script, which in turn does the 
installation itself and starts the Jupyter Notebook server. 


Example 2-5. Bash script to set up the Droplet 


#!/bin/bash 

# 

# Setting up a DigitalOcean Droplet 
# with Basic Python Stack 

# and Jupyter Notebook 

# 

# Python for Finance, 2nd ed. 

# (c) Dr Yves J Hilpisch 

# 


# IP ADDRESS FROM PARAMETER 
MASTER_IP=$1 


# COPYING THE FILES 
scp install.sh root@S{MASTER_IP}: 
scp cert.* jupyter_notebook_config.py root@S{MASTER_IP}: 


# EXECUTING THE INSTALLATION SCRIPT 
ssh root@S{MASTER_IP} bash /root/install.sh 


Everything is now in place to give the setup code a try. On DigitalOcean, create a new 
Droplet with options similar to these: 
Operating system 
Ubuntu 18.10 x64 (the newest version available at the time of this writing) 
Size 
1 core, 1 GB, 25 GB SSD (the smallest Droplet) 


Data center region 
Frankfurt (since your author lives in Germany) 


SSH key 
Add a (new) SSH key for password-less login '! 


11 If you need assistance, visit either “How to Add SSH Keys to Droplets” or “How to Create SSH Keys with 
PuTTY on Windows”. 
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Droplet name 
You can go with the prespecified name or can choose something like py4fi 


Clicking the Create button initiates the Droplet creation process, which generally 
takes about one minute. The major outcome of the setup procedure is the IP address, 
which might be, for instance, 46.101.156.199 if you chose Frankfurt as your data cen- 
ter location. Setting up the Droplet now is as easy as follows: 


(py3) ~/cloud$ bash setup.sh 46.101.156.199 


The resulting process might take a couple of minutes. It is finished when there is a 
message from the Jupyter Notebook server saying something like: 


The Jupyter Notebook is running at: https://[all ip addresses on your 

system] :8888/ 
In any current browser, visiting the following address accesses the running Jupyter 
Notebook server (note the https protocol): 


https: //46.101.156.199:8888 


After perhaps requesting that you add a security exception, the Jupyter Notebook 
login screen prompting for a password (in our case, jupyter) should appear. You are 
now ready to start Python development in the browser via Jupyter Notebook, IPy- 
thon via a terminal window, or the text file editor. Other file management capabili- 
ties, such as file upload, deletion of files, and creation of folders, are also available. 


Benefits of the Cloud 


Cloud instances like those from DigitalOcean and Jupyter Note- 
book are a powerful combination, allowing the Python developer 
and quant to work on and make use of professional compute and 
storage infrastructure. Professional cloud and data center providers 
make sure that your (virtual) machines are physically secure and 
highly available. Using cloud instances also keeps the cost of the 
exploration and development phase rather low, since usage gener- 
ally gets charged by the hour without the need to enter into a long- 
term agreement. 


Conclusion 


Python is the programming language and technology platform of choice, not only for 
this book but for almost every leading financial institution. However, Python deploy- 
ment can be tricky at best and sometimes even tedious and nerve-wracking. Fortu- 
nately, several technologies that help with the deployment issue have become 
available in recent years. The open source conda helps with both Python package and 
virtual environment management. Docker containers go even further, in that com- 
plete filesystems and runtime environments can be easily created in a technically 


56 | Chapter 2: Python Infrastructure 


shielded “sandbox” (i.e., the container). Going even one step further, cloud providers 
like DigitalOcean offer compute and storage capacity in professionally managed and 
secured data centers within minutes, billed by the hour. This in combination with a 
Python 3.7 installation and a secure Jupyter Notebook server installation provides a 
professional environment for Python development and deployment in the context of 
Python-for-finance projects. 


Further Resources 


For Python package management, consult the following resources: 


e pip package manager page 
e conda package manager page 


e Installing Packages page 
For virtual environment management, consult these resources: 


e virtualenv environment manager page 

e conda Managing Environments page 

e pipenv package and environment manager 
The following resources (among others) provide information about Docker 
containers: 

e Docker home page 

e Matthias, Karl, and Sean Kane (2015). Docker: Up and Running. Sebastopol, CA: 

O'Reilly. 

For a concise introduction to and overview of the bash scripting language, see: 


e Robbins, Arnold (2016). Bash Pocket Reference. Sebastopol, CA: O'Reilly. 


How to run a public Jupyter Notebook server securely is explained in the Jupyter Note- 
book documentation. There is also a hub available that allows the management of 
multiple users for a Jupyter Notebook server, called JupyterHub. 


To sign up on DigitalOcean with a 10 USD starting balance in your new account, visit 
the page http://bit.ly/do_sign_up. This pays for two months of usage of the smallest 
Droplet. 
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PART Il 
Mastering the Basics 


This part of the book is concerned with the basics of Python programming. The top- 
ics covered in this part are fundamental for all other chapters to follow in subsequent 
parts and for Python usage in general. 


The chapters are organized according to certain topics such that they can be used as a 
reference to which the reader can come to look up examples and details related to the 
topic of interest: 

e Chapter 3 focuses on Python data types and structures. 

e Chapter 4 is about NumPy and its ndarray class. 

e Chapter 5 is about pandas and its DataFrame class. 


e Chapter 6 discusses object-oriented programming (OOP) with Python. 


CHAPTER 3 
Data Types and Structures 


Bad programmers worry about the code. Good programmers worry about data struc- 
tures and their relationships. 


—Linus Torvalds 


This chapter introduces the basic data types and data structures of Python, and is 
organized as follows: 


“Basic Data Types” on page 62 
The first section introduces basic data types such as int, float, bool, and str. 


“Basic Data Structures” on page 75 
The second section introduces the fundamental data structures of Python (e.g., 
list objects) and illustrates, among other things, control structures, functional 
programming approaches, and anonymous functions. 


The aim of this chapter is to provide a general introduction to Python specifics when 
it comes to data types and structures. The reader equipped with a background from 
another programing language, say C or Matlab, should be able to easily grasp the dif- 
ferences that Python usage might bring along. The topics and idioms introduced here 
are important and fundamental for the chapters to come. 


The chapter covers the following data types and structures: 


int Integer value Natural numbers 

float Floating-point number Real numbers 

bool Boolean value Something true or false 
str String object Character, word, text 
tuple Immutable container Fixed set of objects, record 
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list Mutable container Changing set of objects 
dict Mutable container Key-value store 
set Mutable container Collection of unique objects 


Basic Data Types 


Python is a dynamically typed language, which means that the Python interpreter 
infers the type of an object at runtime. In comparison, compiled languages like C are 
generally statically typed. In these cases, the type of an object has to be specified for 
the object before compile time.' 


Integers 
One of the most fundamental data types is the integer, or int: 


In [1]: a = 10 
type(a) 

Out[1]: int 
The built-in function type provides type information for all objects with standard 
and built-in types as well as for newly created classes and objects. In the latter case, 
the information provided depends on the description the programmer has stored 
with the class. There is a saying that “everything in Python is an object.” This means, 
for example, that even simple objects like the int object just defined have built-in 
methods. For example, one can get the number of bits needed to represent the int 
object in memory by calling the method bit_length(): 

In [2]: a.bit_length() 

Out[2]: 4 
The number of bits needed increases the higher the integer value is that one assigns 
to the object: 

In [3]: a = 100000 

a.bit_length() 

Out[3]: 17 
In general, there are so many different methods that it is hard to memorize all meth- 
ods of all classes and objects. Advanced Python environments like [Python provide 
tab completion capabilities that show all the methods attached to an object. You sim- 
ply type the object name followed by a dot (e.g., a.) and then press the Tab key. This 


1 The Cython package brings static typing and compiling features to Python that are comparable to those in C. 
In fact, Cython is not only a package, it is a full-fledged hybrid programming language combining Python and 
C. 
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then provides a collection of methods you can call on the object. Alternatively, the 
Python built-in function dir gives a complete list of the attributes and methods of 
any object. 


A specialty of Python is that integers can be arbitrarily large. Consider, for example, 
the googol number 10%., Python has no problem with such large numbers: 
In [4]: googol = 10 ** 100 
googol 
Out[4]: 10000000000000000000000000000000000000000000000000000000000000000000000000 
000000000000000000000000000 


In [5]: googol.bit_length() 
Out[5]: 333 


Large Integers 


Python integers can be arbitrarily large. The interpreter simply 
uses as many bits/bytes as needed to represent the numbers. 


Arithmetic operations on integers are also easy to implement: 


In [6]: 1+ 4 
Out[6]: 5 


In [7]: 


1/4 
Out[7]: 0.25 


In [8]: type(1 / 4) 
Out[8]: float 


Floats 


The last expression returns the mathematically correct result of 0.25,” which gives rise 
to the next basic data type, the float. Adding a dot to an integer value, like in 1. or 
1.0, causes Python to interpret the object as a float. Expressions involving a float 
also return a float object in general:* 


In [9]: 1.6 / 4 
Out[9]: 0.4 


2 This is different in Python 2.x, where floor division is the default. Floor division in Python 3.x is accom- 
plished by 3 // 4, which gives 0 as the result. 


3 Here and in the following discussion, terms like float, float object, etc. are used interchangeably, acknowl- 
edging that every float is also an object. The same holds true for other object types. 
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In [10]: type (1.6 / 4) 

Out[10]: float 
A float is a bit more involved in that the computerized representation of rational or 
real numbers is in general not exact and depends on the specific technical approach 
taken. To illustrate what this implies, let us define another float object, b. float 
objects like this one are always represented internally up to a certain degree of accu- 
racy only. This becomes evident when adding 0.1 to b: 


In [11]: b = 0.35 


type(b) 
Out[11]: float 


In [12]: b + 0.1 
Out[12]: 0.44999999999999996 


The reason for this is that float objects are internally represented in binary format; 
that is, a decimal number 0 < n < 1 is represented by a series of the form 


n=>+7+4+... For certain floating-point numbers the binary representation 
might involve a large number of elements or might even be an infinite series. How- 
ever, given a fixed number of bits used to represent such a number—i.e., a fixed num- 
ber of terms in the representation series—inaccuracies are the consequence. Other 
numbers can be represented perfectly and are therefore stored exactly even with a 
finite number of bits available. Consider the following example: 


In [43]: ¢-= 0.5 


c.as_integer_ratio() 
Out[13]: (4, 2) 


One-half, i.e., 0.5, is stored exactly because it has an exact (finite) binary representa- 
tion as 0.5 = >. However, for b = 0.35 one gets something different than the 
expected rational number 0.35 = E 


In [14]: b.as_integer_ratio() 

Out[14]: (3152519739159347, 9007199254740992) 
The precision is dependent on the number of bits used to represent the number. In 
general, all platforms that Python runs on use the IEEE 754 double-precision stan- 
dard—i.e., 64 bits—for internal representation. This translates into a 15-digit relative 
accuracy. 


Since this topic is of high importance for several application areas in finance, it is 
sometimes necessary to ensure the exact, or at least best possible, representation of 
numbers. For example, the issue can be of importance when summing over a large set 
of numbers. In such a situation, a certain kind and/or magnitude of representation 
error might, in aggregate, lead to significant deviations from a benchmark value. 
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The module decimal provides an arbitrary-precision object for floating-point num- 
bers and several options to address precision issues when working with such 
numbers: 


In [15]: import decimal 
from decimal import Decimal 


In [16]: decimal.getcontext() 

Out[16]: Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999, 
capitals=1, clamp=0, flags=[], traps=[InvalidOperation, DivisionByZero, 
Overflow] ) 


In [17]: d = Decimal(1) / Decimal (11) 
d 
Out[17]: Decimal('0.09090909090909090909090909091' ) 


One can change the precision of the representation by changing the respective 
attribute value of the Context object: 


In [18]: decimal.getcontext().prec = 4 (1) 


In [19]: e = Decimal(1) / Decimal (11) 
e 
Out[19]: Decimal('0.09091') 


In [20]: decimal.getcontext().prec = 50 (2) 


In [21]: f = Decimal(1) / Decimal (11) 
f 
Out[21]: Decimal('0.090909090909090909090909090909090909090909090909091' ) 


@ Lower precision than default. 


© Higher precision than default. 


If needed, the precision can in this way be adjusted to the exact problem at hand and 
one can operate with floating-point objects that exhibit different degrees of accuracy: 


In [22]: g=dter+f 
g 
Out[22]: Decimal('0.27272818181818181818181818181909090909090909090909 ' ) 


Arbitrary-Precision Floats 


The module decimal provides an arbitrary-precision floating-point 
number object. In finance, it might sometimes be necessary to 
ensure high precision and to go beyond the 64-bit double-precision 
standard. 
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Booleans 


In programming, evaluating a comparison or logical expression (such as 4 > 3, 4.5 
<= 3.25 or (4 > 3) and (3 > 2)) yields one of True or False as output, two impor- 
tant Python keywords. Others are, for example, def, for, and if. A complete list of 
Python keywords is available in the keyword module: 


In [23]: import keyword 


In [24]: keyword.kwlist 
Out[24]: ['False', 
"None', 
'True', 
cand“, 
Nas", 
"assert', 
"async', 
"await', 
"break', 
"class"; 
‘continue’, 
‘def’, 
‘del’, 
‘elif', 
‘else', 
"except', 
'finally', 
“TOF; 
'from', 
"global', 
N, 
'import', 
ria 
"ts", 
"Lambda', 
"nonlocal', 


not', 
SOR; 
"pass', 
"raise', 
'return', 
‘try’, 
"while', 
‘with’, 
"yield'] 
True and False are of data type bool, standing for Boolean value. The following code 


shows Python’s comparison operators applied to the same operands with the resulting 
bool objects: 
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In [25]: 4>3 @ 
Out[25]: True 


In [26]: type(4 > 3) 
Out[26]: bool 


In [27]: type(False) 
Out[27]: bool 


In [28]: 4>=3 @ 
Out[28]: True 


In [29]: 4<3 © 
Out[29]: False 


In [30]: 4<=3 O 
Out[30]: False 


In [31]: 4==3 © 
Out[31]: False 


In [32]: 4 != 3 © 
Out[32]: True 


Is greater. 
Is greater or equal. 
Is smaller. 
Is smaller or equal. 


Is equal. 


© © 6 8 8 8 


Is not equal. 


Often, logical operators are applied on bool objects, which in turn yields another 
bool object: 


In [33]: True and True 
Out[33]: True 


In [34]: True and False 
Out[34]: False 


In [35]: False and False 
Out[35]: False 


In [36]: True or True 
Out[36]: True 
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In [37]: True or False 
Out[37]: True 


In [38]: False or False 
Out[38]: False 


In [39]: not True 
Out[39]: False 


In [40]: not False 
Out[40]: True 


Of course, both types of operators are often combined: 


In [41]: (4 > 3) and (2 > 3) 
Out[41]: False 


In [42]: (4 == 3) or (2 != 3) 
Out[42]: True 


In [43]: not (4 != 4) 
Out[43]: True 


In [44]: (not (4 != 4)) and (2 == 3) 
Out[44]: False 


One major application area is to control the code flow via other Python keywords, 
such as if or while (more examples later in the chapter): 


In [45]: if 4>3: @ 
print('condition true') (2) 
condition true 


In [46]: i=0 © 

while i< 4: QO 
print('condition true, i = ', i) (5) 
i += 1 

condition true, 

condition true, 

condition true, 

condition true, 


(ase gale) gies gee 
I 
WnNrR © 


If condition holds true, execute code to follow. 
The code to be executed if condition holds true. 
Initializes the parameter i with 0. 


As long as the condition holds true, execute and repeat the code to follow. 


© © O 8 8 


Prints a text and the value of parameter i. 
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© Increases the parameter value by 1;i += 1is the sameasi = i + 1. 


Numerically, Python attaches a value of 0 to False and a value of 1 to True. When 
transforming numbers to bool objects via the bool() function, a 0 gives False while 
all other numbers give True: 


In [47]: int (True) 
Out[47]: 1 


In [48]: int(False) 
Out[48]: 0 


In [49]: float(True) 
Out[49]: 1.0 


In [50]: float(False) 
Out[50]: 0.0 


In [51]: bool(0) 
Out[51]: False 


In [52]: bool(0.0) 
Out[52]: False 


In [53]: bool(1) 
Out[53]: True 


In [54]: bool(10.5) 
Out[54]: True 


In [55]: bool(-2) 
Out[55]: True 


Strings 


Now that natural and floating-point numbers can be represented, this subsection 
turns to text. The basic data type to represent text in Python is str. The str object 
has a number of helpful built-in methods. In fact, Python is generally considered to 
be a good choice when it comes to working with texts and text files of any kind and 
any size. A str object is generally defined by single or double quotation marks or by 
converting another object using the str() function (ie., using the object’s standard 
or user-defined str representation): 


In [56]: t = 'this is a string object' 


With regard to the built-in methods, you can, for example, capitalize the first word in 
this object: 


In [57]: t.capitalize() 
Out[57]: 'This is a string object' 
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Or you can split it into its single-word components to get a list object of all the 
words (more on List objects later): 


In [58]: t.split() 

Out[58]: ['this', 'is', 'a', 'string', ‘object'] 
You can also search for a word and get the position (i.e., index value) of the first letter 
of the word back in a successful case: 


In [59]: t.find('string') 
Out[59]: 10 


If the word is not in the str object, the method returns - 1: 


In [60]: t.find('Python') 
Out[60]: -1 


Replacing characters in a string is a typical task that is easily accomplished with the 
repLace() method: 


In [61]: t.replace(' ', '|') 
Out[61]: 'this|is|a]string|object' 


The stripping of strings—i.e., deletion of certain leading/lagging characters—is also 
often necessary: 


In [62]: 'http://www.python.org'.strip('htp:/') 
Out[62]: 'www.python.org' 


Table 3-1 lists a number of helpful methods of the str object. 


Table 3-1. Selected string methods 


Method Arguments Returns/result 

capitalize () Copy of the string with first letter capitalized 

count (sub[, start[, end]]) Count of the number of occurrences of substring 

encode (Lencoding[, errors]]) Encoded version of the string 

find (sub[, start[, end]]) (Lowest) index where substring is found 

join (seq) Concatenation of strings in sequence seq 

replace (old, new[, count]) Replaces old by newthe first count times 

split ([sep[, maxsplit]]) List of words in string with sep as separator 

splitlines ([keepends]) Separated lines with line ends/breaks if keepends is True 
strip (chars) Copy of string with leading/lagging characters in chars removed 
upper () Copy with all letters capitalized 
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Unicode Strings 


A fundamental change from Python 2.7 (used for the first edition 
of the book) to Python 3.7 (used for this second edition) is the 
encoding and decoding of string objects and the introduction of 
Unicode. This chapter does not go into the many details important 
in this context; for the purposes of this book, which mainly deals 
with numerical data and standard strings containing English 
words, this omission seems justified. 


Excursion: Printing and String Replacements 


Printing str objects or string representations of other Python objects is usually 
accomplished by the print() function: 


In [63]: print('Python for Finance’) (1) 
Python for Finance 


In [64]: print(t) (2) 
this is a string object 


In [65]: i = 0 
while i < 4: 
print(i) © 


it=1 


In [66]: i = 
while i < 4: 
print(i, end='|') (4) 
i += 1 
0112/3] 


Prints a str object. 
Prints a str object referenced by a variable name. 


Prints the string representation of an int object. 


© © 8 8 


Specifies the final character(s) when printing; default is a line break (\n) as seen 
before. 


Python offers powerful string replacement operations. There is the old way, via the % 
character, and the new way, via curly braces ({}) and format(). Both are still applied 
in practice. This section cannot provide an exhaustive illustration of all options, but 
the following code snippets show some important ones. First, the old way of doing it: 
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© © © © O O Ọ © © Ọ 


In [67]: 
Out[67]: 


In [68]: 
Out[68]: 


In [69]: 
Out[69]: 


In [70]: 
Out[70]: 


In [71]: 
Out[71]: 


In [72]: 
Out[72]: 


In [73]: 
Out[73]: 


In [74]: 
Out[74]: 


In [75]: 
Out[75]: 


In [76]: 
Out[76]: 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


"this 
"this 


is 
is 


is 
is 


is 
is 


is 
is 


is 
is 


is 
is 


is 
is 


is 
is 


is 


an 
an 


integer %d' % 15 (1) 
integer 15' 


integer %4d' % 15 (2) 
integer 15' 


integer %04d' % 15 © 
integer 0015' 


float %f' % 15.3456 @ 
float 15.345600' 


a float %.2f' % 15.3456 @ 
a float 15.35' 


a float %8f' % 15.3456 © 
a float 15.345600' 


a float %8.2f' % 15.3456 @ 
a float 15.35" 


a float %08.2f' % 15.3456 © 


is a 


is 
is 


float 00015.35' 


string %s' % 'Python' © 
string Python' 


a string %10s' % 'Python' ® 
a string Python' 


int object replacement. 


With fixed number of characters. 


With leading zeros if necessary. 


float object replacement. 


With fixed number of decimals. 


With fixed number of characters (and filled-up decimals). 


With fixed number of characters and decimals ... 


... and leading zeros if necessary. 


str object replacement. 


With fixed number of characters. 


| Chapter 3: Data Types and Structures 


Now, here are the same examples implemented in the new way. Notice the slight dif- 
ferences in the output in some places: 


In [77]: 
Out[77]: 


In [78]: 
Out[78]: 


In [79]: 
Out[79]: 


In [80]: 
Out[80]: 


In [81]: 
Out[81]: 


In [82]: 
Out[82]: 


In [83]: 
Out[83]: 


In [84]: 
Out[84]: 


In [85]: 
Out[85]: 


In [86]: 
Out [86]: 


'this is an integer {:d}'.format(15) 
‘this is an integer 15' 


‘this is an integer {:4d}'.format(15) 
‘this is an integer 15' 


‘this is an integer {:04d}'.format(15) 
‘this is an integer 0015' 


"this is a float {:f}'.format(15.3456) 
'this is a float 15.345600' 


‘this is a float {:.2f}'.format(15.3456) 
"this is a float 15.35' 


"this is a float {:8f}'.format(15.3456) 
‘this is a float 15.345600' 


'this is a float {:8.2f}'.format(15.3456) 
‘this is a float 15,35" 


'this is a float {:08.2f}'.format(15.3456) 
‘this is a float 00015.35' 


'this is a string {:s}'.format('Python') 
"this is a string Python' 


‘this is a string {:10s}'.format('Python') 
‘this is a string Python ’ 


String replacements are particularly useful in the context of multiple printing opera- 
tions where the printed data is updated, for instance, during a while loop: 


In [87]: 


In [88]: 


i= 0 

while i < 4: 
print('the number is %d' % i) 
it=1 

the number is 

the number is 

the number is 

the number is 


WNrF © 


i=0 

while i < 4: 
print('the number is {:d}'.format(i)) 
i += 1 

the number is 

the number is 

the number is 

the number is 


WNeOO 
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Excursion: Regular Expressions 


A powerful tool when working with str objects is regular expressions. Python pro- 
vides such functionality in the module re: 


In [89]: import re 


Suppose a financial analyst is faced with a large text file, such as a CSV file, which 
contains certain time series and respective date-time information. More often than 
not, this information is delivered in a format that Python cannot interpret directly. 
However, the date-time information can generally be described by a regular expres- 
sion. Consider the following str object, containing three date-time elements, three 
integers, and three strings. Note that triple quotation marks allow the definition of 
str objects over multiple rows: 


In [90]: series = """ 

"01/18/2014 13:00:00', 100, '1st'; 
"01/18/2014 13:30:00', 110, '2nd'; 
"01/18/2014 14:00:00', 120, '3rd' 


The following regular expression describes the format of the date-time information 
provided in the str object: 


In [91]: dt = re.compile("'[0-9/:\s]+'") # datetime 


Equipped with this regular expression, one can go on and find all the date-time ele- 
ments. In general, applying regular expressions to str objects also leads to perfor- 
mance improvements for typical parsing tasks: 


In [92]: result = dt.findall(series) 
result 

Out[92]: ["'01/18/2014 13:00:00'", "'01/18/2014 13:30:00'", "'01/18/2014 
14:00:00'"] 


Regular Expressions 


When parsing str objects, consider using regular expressions, 
which can bring both convenience and performance to such 
operations. 


The resulting str objects can then be parsed to generate Python datetime objects 
(see Appendix A for an overview of handling date and time data with Python). To 


4 It is not possible to go into detail here, but there is a wealth of information available on the internet about 
regular expressions in general and for Python in particular. For an introduction to this topic, refer to Fitzger- 
ald (2012). 
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parse the str objects containing the date-time information, one needs to provide 
information of how to parse them—again as a str object: 
In [93]: from datetime import datetime 
pydt = datetime.strptime(result[0].replace("'", ""), 
'%m/%d/%Y %H:5M:%S") 


pydt 
Out[93]: datetime.datetime(2014, 1, 18, 13, 0) 


In [94]: print(pydt) 
2014-01-18 13:00:00 


In [95]: print(type(pydt)) 
<class 'datetime.datetime'> 


Later chapters provide more information on date-time data, the handling of such 
data, and datetime objects and their methods. This is just meant to be a teaser for 
this important topic in finance. 


Basic Data Structures 


As a general rule, data structures are objects that contain a possibly large number of 
other objects. Among those that Python provides as built-in structures are: 


tuple 

An immutable collection of arbitrary objects; only a few methods available 
list 

A mutable collection of arbitrary objects; many methods available 
dict 

A key-value store object 


set 
An unordered collection object for other unique objects 


Tuples 


A tuple is an advanced data structure, yet it’s still quite simple and limited in its 
applications. It is defined by providing objects in parentheses: 


In [96]: t = (1, 2.5, 'data') 
type(t) 
Out[96]: tuple 
You can even drop the parentheses and provide multiple objects, just separated by 
commas: 
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in [97]: t= 4, 2.5, “dete 
type(t) 
Out[97]: tuple 
Like almost all data structures in Python the tuple has a built-in index, with the help 
of which you can retrieve single or multiple elements of the tuple. It is important to 
remember that Python uses zero-based numbering, such that the third element of a 
tuple is at index position 2: 


In [98]: t[2] 
Out[98]: 'data' 


In [99]: type(t[2]) 
Out[99]: str 


Zero-Based Numbering 


In contrast to some other programming languages like Matlab, 
Python uses zero-based numbering schemes. For example, the first 
element of a tuple object has index value 0. 


There are only two special methods that this object type provides: count() and 
index(). The first counts the number of occurrences of a certain object and the sec- 
ond gives the index value of the first appearance of it: 


In [100]: t.count('data') 
Out[100]: 1 


In [101]: t.index(1) 

Out[101]: 0 
tuple objects are immutable objects. This means that they, once defined, cannot be 
changed easily. 


Lists 


Objects of type list are much more flexible and powerful in comparison to tuple 
objects. From a finance point of view, you can achieve a lot working only with list 
objects, such as storing stock price quotes and appending new data. A list object is 
defined through brackets and the basic capabilities and behaviors are similar to those 
of tuple objects: 

In [102]: l = [1, 2.5, 'data'] 

[2] 

Out[102]: 'data' 
list objects can also be defined or converted by using the function list(). The fol- 
lowing code generates a new list object by converting the tuple object from the 
previous example: 
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In [103]: l = list(t) 
je 
Out[103]: [1, 2.5, "data'] 


In [104]: type(l) 
Out[104]: list 


In addition to the characteristics of tuple objects, list objects are also expandable 
and reducible via different methods. In other words, whereas str and tuple objects 
are immutable sequence objects (with indexes) that cannot be changed once created, 
list objects are mutable and can be changed via different operations. You can 
append list objects to an existing list object, and more: 


In [105]: 


l.append([4, 3]) (1) 
l 
Out[105]: [1, 2.5, 'data', [4, 3]] 


In [106]: l.extend([1.0, 1.5, 2.0]) © 
l 
Out[106]: [1, 2.5, 'data', [4, 3], 1.0, 1.5, 2.0] 


In [107]: lL.insert(1, 'insert') © 
l 
Out[107]: [1, "insert", 2.5, "data"; [4, 3], 1.0, 1.5, 2.0] 


In [108]: l.remove('data') (4) 
iL 
Out[108]: [1, "insert", 2.5, [4, 3], 1.0, 1.5, 2.0] 


In [109]: p = l.pop(3) (5) 
print(l, p) 
[i, "insert"; 2,5,. 170, 1.5, 2:01] [4; 3] 


Append list object at the end. 
Append elements of the list object. 


Insert object before index position. 


© © 8 8 


Remove first occurrence of object. 


Remove and return object at index position. 


Slicing is also easily accomplished. Here, slicing refers to an operation that breaks 
down a data set into smaller parts (of interest): 


In [110]: 1[2:5] © 
Out[110]: [2.5, 1.0, 1.5] 


© Return the third through fifth elements. 


Basic Data Structures | 77 


Table 3-2 provides a summary of selected operations and methods of the list object. 


Table 3-2. Selected operations and methods of list objects 


Method Arguments Returns/result 

l[i] = x [i] Replaces i-th element by x 

Uli:g:k] = s [i:j:k] Replaces every k-th element from i to j — 1 by s 
append (x) Appends x to object 

count (x) Number of occurrences of object x 

del l[i:j:k] [i:j:k] Deletes elements with index values i to j — 1 and step size k 
extend (s) Appends all elements of s to object 

index (x[, if, j]]) First index of x between elements i and j- 1 
insert (i, x) Inserts x at/before index i 

remove (x) Removes element x at first match 

pop (i) Removes element with index i and returns it 
reverse () Reverses all items in place 

sort ([cmp[, key[, reverse]]])} Sorts all items in place 


Excursion: Control Structures 


Although a topic in themselves, control structures like for loops are maybe best intro- 
duced in Python based on list objects. This is due to the fact that looping in general 
takes place over list objects, which is quite different to what is often the standard in 
other languages. Take the following example. The for loop loops over the elements of 
the List object 1 with index values 2 to 4 and prints the square of the respective ele- 
ments. Note the importance of the indentation (whitespace) in the second line: 

In [111]: for element in 1[2:5]: 


print(element ** 2) 
Aa) 


NRO 
NON 


seo 


This provides a really high degree of flexibility in comparison to the typical counter- 
based looping. Counter-based looping is also an option with Python, but is accom- 
plished using the range object: 

In [112]: r = range(0, 8, 1) (1 


f 
Out[112]: range(0, 8) 


In [113]: type(r) 
Out[113]: range 


@ Parameters are start, end, and step-size. 
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For comparison, the same loop is implemented using range() as follows: 


In [114]: for i in range(2, 5): 
print(1l[i] ** 2) 
5 


NBO 
NON 


5 


Looping over Lists 


In Python you can loop over arbitrary list objects, no matter what 
the content of the object is. This often avoids the introduction of a 
counter. 


Python also provides the typical (conditional) control elements if, elif, and else. 
Their use is comparable in other languages: 


In [115]: for i in range(1, 10): 
ifi%žx2=0: 0 
print("%d is even" % i) 
elif i % 3 = 0: 
print("%d is multiple of 3" % i) 
else: 
print("%d is odd" % i) 
is odd 
is even 
is multiple of 3 
is even 
is odd 
is even 
is odd 
is even 
is multiple of 3 


WOANKDUNBRWYDN 


@ %stands for modulo. 
Similarly, while provides another means to control the flow: 


In [116]: total = 0 
while total < 100: 
total += 1 
print(total) 
100 


A specialty of Python is so-called list comprehensions. Instead of looping over existing 
list objects, this approach generates list objects via loops in a rather compact fash- 
ion: 

In [117]: m = [i ** 2 for i in range(5)] 


m 
Out[117]: [0, 1, 4, 9, 16] 
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In a certain sense, this already provides a first means to generate “something like” 
vectorized code in that loops are implicit rather than explicit (vectorization of code is 
discussed in more detail in Chapters 4 and 5). 


Excursion: Functional Programming 


Python provides a number of tools for functional programming support as well—i.e., 
the application of a function to a whole set of inputs (in our case list objects). 
Among these tools are filter(), map(), and reduce(). However, one needs a func- 
tion definition first. To start with something really simple, consider a function f() 
that returns the square of the input x: 

In [118]: def f(x): 

return x ** 2 
f(2) 

Out[118]: 4 
Of course, functions can be arbitrarily complex, with multiple input/parameter 
objects and even multiple outputs (return objects). However, consider the following 
function: 

In [119]: def even(x): 

return x % 2 == 
even(3) 

Out[119]: False 
The return object is a Boolean. Such a function can be applied to a whole list object 
by using map(): 

In [120]: list(map(even, range(10))) 

Out[120]: [True, False, True, False, True, False, True, False, True, False] 
To this end, one can also provide a function definition directly as an argument to 
map(), making use of Lambda or anonymous functions: 

In [121]: list(map(lambda x: x ** 2, range(10))) 

Out[121]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] 
Functions can also be used to filter a list object. In the following example, the filter 
returns elements of a list object that match the Boolean condition as defined by the 
even function: 


In [122]: list(filter(even, range(15))) 
Out[122]: [0, 2, 4, 6, 8, 10, 12, 14] 
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List Comprehensions, Functional Programming, Anonymous Functions 


It can be considered good practice to avoid loops on the Python 
level as far as possible. List comprehensions and functional pro- 
gramming tools like filter(), map(), and reduce() provide 
means to write code without (explicit) loops that is both compact 
and in general more readable. Lambda or anonymous functions are 
also powerful tools in this context. 


Dicts 


dict objects are dictionaries, and also mutable sequences, that allow data retrieval by 
keys (which can, for example, be str objects). They are so-called key-value stores. 
While list objects are ordered and sortable, dict objects are unordered and not 
sortable, in general.* An example best illustrates further differences to list objects. 
Curly braces are what define dict objects: 


In [123]; d= { 
"Name' : ‘Angela Merkel', 
'Country' : 'Germany', 
"Profession' : 'Chancelor', 
"Age' : 64 
} 

type(d) 
Out[123]: dict 


In [124]: print(d['Name'], d['Age']) 
Angela Merkel 64 


Again, this class of objects has a number of built-in methods: 


In [125]: d.keys() 

Out[125]: dict_keys(['Name', 'Country', 'Profession', 'Age']) 

In [126]: d.values() 

Out[126]: dict_values(['Angela Merkel', 'Germany', 'Chancelor', 64]) 

In [127]: d.items() 

Out[127]: dict_items([('Name', 'Angela Merkel'), ('Country', 'Germany'), 
('Profession', 'Chancelor'), ('Age', 64)]) 

In [128]: birthday = True 


if birthday: 

d['Age'] += 1 
print(d[ 'Age']) 
65 


5 There are variants to the standard dict object, including among others an OrderedDict subclass, which 
remembers the order in which entries are added. See https://docs.python.org/3/library/collections.html. 
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There are several methods to get iterator objects from a dict object. The iterator 
objects behave like List objects when iterated over: 


In [129]: for item in d.items(): 
print(item) 
('Name', ‘Angela Merkel') 
('Country', 'Germany' ) 
('Profession', 'Chancelor') 
('Age', 65) 


In [130]: for value in d.values(): 
print(type(value) ) 
<class 'str'> 
<class 'str'> 
<class 'str'> 
<class ‘int'> 


Table 3-3 provides a summary of selected operations and methods of the dict object. 


Table 3-3. Selected operations and methods of dict objects 


Method Arguments Returns/result 


d[k] [k] Item of d with key k 

d[k] = x [k] Sets item key k to x 

del d[k] [k] Deletes item with key k 

clear () Removes all items 

copy () Makes a copy 

items () Iterator over all items 

keys () Iterator over all keys 

values () Iterator over all values 

popitem (k) Returns and removes item with key k 

update (Le]) Updates items with items from e 
Sets 


The final data structure this section covers is the set object. Although set theory is a 
cornerstone of mathematics and also of financial theory, there are not too many prac- 
tical applications for set objects. The objects are unordered collections of other 
objects, containing every element only once: 


In [132]: s = set([‘*ut,; “d®;. ud", “dwt, d “du"]) 
s 
Qut[131): {"d', du" ues “ud'} 


In [132]: t = set(["d", "dd"; "uu", “u']) 
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With set objects, one can implement basic operations on sets as in mathematical set 
theory. For example, one can generate unions, intersections, and differences: 


In [133]: s.union(t) (13 

Out[133]: {"d'., "dd", "dut; "u; "ud"; “wur} 
In [134]: s.intersection(t) (2) 

Out[134]: {'d', ‘u'} 


In [135]: s.difference(t) © 
Out[135]: {'du', 'ud'} 


In [136]: t.difference(s) e 
Out[136]: {'dd', 'uu'} 


In [137]: s.symmetric_difference(t) (5) 
Out[137]: {'dd', 'du', "ud", ‘'uu'} 


All of s and t. 


Items in both s and t. 


Items in t but not in s. 


(1) 
(2) 
© Items in s but not in t. 
(4) 
© 


Items in either s or t but not both. 


One application of set objects is to get rid of duplicates in a list object: 


In [138]: from random import randint 

l = [randint(0, 10) for i in range(1000)] (1) 
len(l) @ 

Out[138]: 1000 


In [139]: 1[:20] 
Out[139]: [4, 2, 10, 2, 1, 10, 0, 6, 0, 8, 10, 9, 2, 4, 7, 8, 10, 8, 8, 2] 


In [140]: s = set(1l) 
s 
{0; T 2; 3; 4, ` 6, fa 8, 9, 10} 


Out[140]: 


@ 1,000 random integers between 0 and 10. 


© Number of elements in l. 
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Conclusion 


The basic Python interpreter provides a rich set of flexible data structures. From a 
finance point of view, the following can be considered the most important ones: 


Basic data types 
In Python in general and finance in particular, the classes int, float, bool, and 
str provide the atomic data types. 


Standard data structures 
The classes tuple, list, dict, and set have many application areas in finance, 
with list being a flexible all-rounder for a number use cases. 


Further Resources 


With regard to data types and structures, this chapter focuses on those topics that 
might be of particular importance for financial algorithms and applications. How- 
ever, it can only represent a starting point for the exploration of data structures and 
data modeling in Python. 


There are a number of valuable resources available to go deeper from here. The offi- 
cial documentation for Python data structures is found at https://docs.python.org/3/ 
tutorial/datastructures.html. 


Good references in book form are: 
e Goodrich, Michael, et al. (2013). Data Structures and Algorithms in Python. 
Hoboken, NJ: John Wiley & Sons. 


e Harrison, Matt (2017). Illustrated Guide to Python 3. CreateSpace Treading on 
Python Series. 


e Ramalho, Luciano (2016). Fluent Python. Sebastopol, CA: O'Reilly. 
For an introduction to regular expressions, see: 


e Fitzgerald, Michael (2012). Introducing Regular Expressions. Sebastopol, CA: 
O'Reilly. 
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CHAPTER 4 
Numerical Computing with NumPy 


Computers are useless. They can only give answers. 


—Pablo Picasso 


Although the Python interpreter itself already brings a rich variety of data structures 
with it, NumPy and other libraries add to these in a valuable fashion. This chapter 
focuses on NumPy, which provides a multidimensional array object to store homoge- 
neous or heterogeneous data arrays and supports vectorization of code. 


The chapter covers the following data structures: 


Object type Meaning Used for 
ndarray (regular) n-dimensional array object Large arrays of numerical data 


ndarray (record) 2-dimensional array object Tabular data organized in columns 


This chapter is organized as follows: 


“Arrays of Data” on page 86 
This section is about the handling of arrays of data with pure Python code. 


“Regular NumPy Arrays” on page 90 
This is the core section about the regular NumPy ndarray class, the workhorse in 
almost all data-intensive Python use cases involving numerical data. 


“Structured NumPy Arrays” on page 105 
This brief section introduces structured (or record) ndarray objects for the han- 
dling of tabular data with columns. 
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“Vectorization of Code” on page 106 
In this section, vectorization of code is discussed along with its benefits; the sec- 
tion also discusses the importance of memory layout in certain scenarios. 


Arrays of Data 


The previous chapter showed that Python provides some quite useful and flexible 
general data structures. In particular, List objects can be considered a real workhorse 
with many convenient characteristics and application areas. Using such a flexible 
(mutable) data structure has a cost, in the form of relatively high memory usage, 
slower performance, or both. However, scientific and financial applications generally 
have a need for high-performing operations on special data structures. One of the 
most important data structures in this regard is the array. Arrays generally structure 
other (fundamental) objects of the same data type in rows and columns. 


Assume for the moment that only numbers are relevant, although the concept gener- 
alizes to other types of data as well. In the simplest case, a one-dimensional array then 
represents, mathematically speaking, a vector of, in general, real numbers, internally 
represented by float objects. It then consists of a single row or column of elements 
only. In the more common case, an array represents an i x j matrix of elements. This 
concept generalizes to i x j x k cubes of elements in three dimensions as well as to 
general n-dimensional arrays of shape i x j x k x1 x .... 


Mathematical disciplines like linear algebra and vector space theory illustrate that 
such mathematical structures are of high importance in a number of scientific disci- 
plines and fields. It can therefore prove fruitful to have available a specialized class of 
data structures explicitly designed to handle arrays conveniently and efficiently. This 
is where the Python library NumPy comes into play, with its powerful ndarray class. 
Before introducing this class in the next section, this section illustrates two alterna- 
tives for the handling of arrays. 


Arrays with Python Lists 


Arrays can be constructed with the built-in data structures presented in the previous 
chapter. list objects are particularly suited to accomplishing this task. A simple list 
can already be considered a one-dimensional array: 


In [1]: v = [0.5, 0.75, 1.0, 1.5, 2.0] @ 


@ list object with numbers. 


Since list objects can contain arbitrary other objects, they can also contain other 
list objects. In that way, two- and higher-dimensional arrays are easily constructed 
by nested list objects: 
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@ list object with list objects ... 


© ... resulting in a matrix of numbers. 


One can also easily select rows via simple indexing or single elements via double 
indexing (whole columns, however, are not so easy to select): 


In [3]: m[1] 
Out [3]: (0.5, 0.75, 2.0; 2.5, 2.0] 


In [4]: m[1][0] 
Out[4]: 0.5 


Nesting can be pushed further for even more general structures: 


In [5]: vi = [0.5,. 1.5] 


v2)=°[15 2] 
m = [v1, v2] 
c = [m, m] 1] 


Cc 
Out[5]: [[[0.5, 1.5], [1, 2]], [[0.5, 1.5], [1, 2]]] 


In [6]: c[1][1][0] 
Out[6]: 1 


@ Cube of numbers. 


Note that combining objects in the way just presented generally works with reference 
pointers to the original objects. What does that mean in practice? Have a look at the 
following operations: 


In [f]: v 
m 


= [0:5; 075 1:0, 1:5, 20] 
= [v, v, v] 
m 
Out[7]: [[0.5, 0.75, 1.0, 1.5, 2.0 
[0.5, 0.75, 1.0, 1.5, 2.0], 
[0.5, 0.75, 1.0, 1.5, 2.0 


Now change the value of the first element of the v object and see what happens to the 
m object: 


In [8]: v[0] = 'Python' 
m 
Out[8]: [['Python', 0.75 
['Python', 0.75 
['Python', 0.75 
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This can be avoided by using the deepcopy() function of the copy module: 


In [9]: from copy import deepcopy 
v = [0.5, 0.75, 1.0, 1.5, 2.0] 
m = 3 * [deepcopy(v), ] (1) 


m 

Out[9]: [[0.5, 0.75, 1.0, 1.5, 2.0], 
[0.5, 0.75, 1.0, 1.5, 2.0], 
[0.5, 0.75, 1.0, 1.5, 2.0]] 


In [10]: v[0] = 'Python' (2) 
m 
Out[10]: [[0.5, 0.75, 1.0, 1.5, 2.0 
[0.5, 0.75, 1.0, 1.5, 2.0], 
[0.5, 0.75, 1.0, 1.5, 2.0 
@ Instead of reference pointer, physical copies are used. 
@ Asa consequence, a change in the original object ... 


© ... does not have any impact anymore. 


The Python array Class 


There is a dedicated array module available in Python. According to the documenta- 
tion: 


This module defines an object type which can compactly represent an array of basic 
values: characters, integers, floating point numbers. Arrays are sequence types and 
behave very much like lists, except that the type of objects stored in them is con- 
strained. The type is specified at object creation time by using a type code, which is a 
single character. 


Consider the following code, which instantiates an array object out of a List object: 
In [11]: v = [0.5, 0.75, 1.0, 1.5, 2.0] 
In [12]: import array 


In [13]: a = array.array('f', v) O 
a 
Out[13]: array('f', [0.5, 0.75, 1.0, 1.5, 2.01) 


In [14]: a.append(0.5) (2) 
a 
Out[14]: array('f', [0.5, 0.75, 1.0, 1.5, 2.0, 0.5]) 


In [15]: a.extend([5.0, 6.75]) @ 
a 
Out[15]: array('f', [0.5, 0.75, 1.0, 1.5, 2.0, 0.5, 5.0, 6.75]) 
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In [16]: 2*a © 


utii]: array( f"; (0.55. 075; 120 125, 2:0, 0:5; 5.8, 6.75, 6.5, 0:75; 1.0, 


125, 2:0;. Oud; 5.0; -6.75']) 


The instantiation of the array object with float as the type code. 


Major methods work similar to those of the list object. 


Although “scalar multiplication” works in principle, the result is not the mathe- 


matically expected one; rather, the elements are repeated. 


Trying to append an object of a different data type than the one specified raises a 


TypeError: 


In [17]: a.append('string') (1) 


TypeErrorTraceback (most recent call last) 
<ipython-input-17-14cd6281866b> in <module>() 
----> 1 a.append('string') (1 


TypeError: must be real number, not str 


In [18]: a.tolist() (2) 
Out[18]: [0.5, 0.75, 1.0, 1.5, 2.0, 0.5, 5.0, 6.75] 


@ Only float objects can be appended; other data types/type codes raise errors. 


© However, the array object can easily be converted back to a list object if such 


flexibility is required. 


An advantage of the array class is that it has built-in storage and retrieval functional- 


ity: 
In [19]: f = open('array.apy', 'wb') (1) 
a.tofile(f) 
f.close() © 


In [20]: with open('array.apy', 'wb') as f: 4 ] 
a.tofile(f) @ 


In [21]: !ls -n arr* (5) 
-rw-r--r--@ 1 503 20 32 Nov 7 11:46 array.apy 


@ Opens a file on disk for writing binary data. 
@ Writes the array data to the file. 


© Closes the file. 
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© Alternative: uses a with context for the same operation. 


© Shows the file as written on disk. 


As before, the data type of the array object is of importance when reading the data 
from disk: 


In [22]: b = array.array('f') (1) 


In [23]: with open('array.apy', 'rb') as f: (2) 
b.fromfile(f, 5) 


In [24]: b © 
Out[24]: array('f', [0.5, 0.75, 1.0, 1.5, 2.0]) 


In [25]: b = array.array('d') e 


In [26]: with open('array.apy', 'rb') as f: 
b.fromfile(f, 2) 


In [27]: b © 
Out[27]: array('d', [0.0004882813645963324, 0.12500002956949174]) 


Instantiates a new array object with type code float. 
Opens the file for reading binary data ... 
... and reads five elements in the b object. 


Instantiates a new array object with type code double. 


© © 8 8 8 


Reads two elements from the file. 


The difference in type codes leads to “wrong” numbers. 


Regular NumPy Arrays 


Composing array structures with List objects works, somewhat. But it is not really 
convenient, and the list class has not been built with this specific goal in mind. It 
has rather a much broader and more general scope. The array class is a bit more spe- 
cialized, providing some useful features for working with arrays of data. However, a 
truly specialized class could be really beneficial to handle array-type structures. 


The Basics 


numpy.ndarray is just such a class, built with the specific goal of handling n- 
dimensional arrays both conveniently and efficiently—i.e., in a highly performant 
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manner. The basic handling of instances of this class is again best illustrated by 
examples: 


In [28]: import numpy as np (1) 
In [29]: a = np.array([0, 0.5, 1.0, 1.5, 2.0]) (2) 


a 
Qut[29]: array([0; ; 0:5, 1s, 1-5, 2a ]) 


In [30]: type(a) (2) 
Out[30]: numpy.ndarray 


In [31]: a = np.array(['a', 'b', 'c']) © 
a 
Out[31]: array(['a', 'b', 'c'], dtype='<U1') 


In [32]: a = np.arange(2, 20, 2) (4) 

Out[32]: wai Z; 4; 6, 8, 10, 12; 14, 16, 181) 
In [33]: a = np.arange(8, dtype=np.float) (5) 
Out[33]: eras, Tar Boe Sis Hay. Sao Sts. Fel) 


In [34]: a[5:] © 
Out[34]: array([5., 6., 7.]) 


In [35]: a[:2] © 
Out[35]: array([0., 1.]) 


Imports the numpy package. 
Creates an ndarray object out of a List object with floats. 


Creates an ndarray object out of a List object with strs. 


... but takes as additional input the dtype parameter. 


Oo 

(2) 

© 

@ np.arange() works similar to range() ... 

© 

© With one-dimensional ndarray objects, indexing works as usual. 
A 


major feature of the ndarray class is the multitude of built-in methods. For 
instance: 


In [36]: a.sum() (1) 
Out[36]: 28.0 


In [37]: a.std() (2) 
Out[37]: 2.29128784747792 
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In [38]: a.cumsum() © 
Qut[3s]: array ([ Osy: da, Bes. Éis Wu, iSi 2i, 28l) 


© The sum ofall elements. 
© The standard deviation of the elements. 


© The cumulative sum of all elements (starting at index position 0). 


Another major feature is the (vectorized) mathematical operations defined on 
ndarray objects: 
In [39]: l= [0., 0.5, 1.5, 3., 5.] 
2*1 © 
OUtlSS]: (6.8, 0:5; 1:5; 3.0, 5.0, 0.0, 0:5, 1-5; 3.0, 50] 


In [40]: a 
Outs]: array (lOs Tar 2er Suey ey Bes Oes 7al) 


In [41]: 2 *a © 
Gutia]: array([ Os Zas Bey Ges Bes 10; 12, 14:1) 


In [42]: a** 2 © 
Out[42]: array([ Os, 1.5 4a, 9: 16., 25s; 36., 49. ]) 


In [43]: 2 **a Q 
Out[43]: array([ 1., 2., 4., Ba 16., 32., 64., 128.]) 


In [44]: a ** a (5) 


Out[44]: array([1.00000e+00, 1.00000e+00, 4.00000e+00, 2.70000e+01, 2.56000e+02, 
3.12500e+03, 4.66560e+04, 8.23543e+05]) 


Scalar multiplication with list objects leads to a repetition of elements. 


By contrast, working with ndarray objects implements a proper scalar multipli- 
cation. 


© This calculates element-wise the square values. 
© This interprets the elements of the ndarray as the powers. 


© This calculates the power of every element to itself. 


Universal functions are another important feature of the NumPy package. They are 
“universal” in the sense that they in general operate on ndarray objects as well as on 
basic Python data types. However, when applying universal functions to, say, a 
Python float object, one needs to be aware of the reduced performance compared to 
the same functionality found in the math module: 
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In [45]: 
Out[45]: 


In [46]: 
Out[46]: 


In [47]: 
Out[47]: 
In [48]: 


In [49]: 
Out[49]: 


In [50]: 


In [51]: 


In [52]: 


np.exp(a) (1) 
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01, 
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03]) 


np.sqrt(a) (2) 
array([0. aila » 1.41421356, 1.73205081, 2. 4 
2.23606798, 2.44948974, 2.64575131]) 


np.sqrt(2.5) © 
1.5811388300841898 


import math (4) 


math.sqrt(2.5) (4) 
1.5811388300841898 


math.sqrt(a) (5) 


TypeErrorTraceback (most recent call last) 
<ipython-input-50-b39de4150838> in <module>() 
----> 1 math.sqrt(a) (5) 


TypeError: only size-1 arrays can be converted to Python scalars 


%timeit np.sqrt(2.5) Q 
722 ns + 13.7 ns per loop (mean + std. dev. of 7 runs, 1000000 loops 
each) 


%timeit math.sqrt(2.5) (7) 
91.8 ns + 4.13 ns per loop (mean + std. dev. of 7 runs, 10000000 loops 
each) 


Calculates the exponential values element-wise. 


Calculates the square root for every element. 


Calculates the square root for a Python float object. 


The same calculation, this time using the math module. 


The math. sqrt() function cannot be applied to the ndarray object directly. 


Applying the universal function np.sqrt() to a Python float object ... 


... is much slower than the same operation with the math. sqrt() function. 
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Multiple Dimensions 


The transition to more than one dimension is seamless, and all features presented so 
far carry over to the more general cases. In particular, the indexing system is made 


consistent across all dimensions: 


In [53]: b = np.array([a, a * 
b 

Out[53]: 

[ Be; 

In [54]: b[o] @ 

Out[54]: 


In [55]: b[0, 2] © 


Out[55]: 2.0 

In [56]: b[:, 1] O 
Out[56]: array([1., 2.]) 

In [57]: b.sum() (5) 

Out[57]: 84.0 

In [58]: b.sum(axis=0) Q 
Out[58]: array([ 0., 3., 6. 
In [59]: b.sum(axis=1) (7) 
Out[59]: array([28., 56.]) 


array([[ 0., 1., 2. 
2., 4. 


array (Os, 2s. 2.5 3s 


21) © 
3 Bes 4., 5 ’ 6 ’ Tals 
z ts, By, 100; 22., 24.11) 


Constructs a two-dimensional ndarray object out of the one-dimensional one. 


Selects the first row. 


© 


ets, by a comma. 


Selects the second column. 


© 6 68 © 


Selects the third element in the first row; indices are separated, within the brack- 


Calculates the sum of all values. 
Calculates the sum along the first axis; i.e., column-wise. 


Calculates the sum along the second axis; i.e., row-wise. 


There are a number of ways to initialize (instantiate) ndarray objects. One is as pre- 
sented before, via np.array. However, this assumes that all elements of the array are 
already available. In contrast, one might like to have the ndarray objects instantiated 
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first to populate them later with results generated during the execution of code. To 


this end, one can use the following functions: 


In [60]: c = np.zeros((2, 3), dtype='i', order='C') (13 
la 
Out[60]: array([[0, 0, 0], 
[0, 0, 0]], dtype=int32) 


In [61]: c = np.ones((2, 3, 4), dtype='i', order='C') @ 
€ 
Out[61]: array([[[1, 1, 1, 1], 
ih 1; 1; 1]; 
[1, 1, 1, 1]], 


[[1, 1, 1; il; 
Eas 1, 1; ij, 
[1, 1, 1, 1]]], dtype=int32) 


In [62]: d = np.zeros_like(c, dtype='f16', order='C') © 
d 
Out[62]: array([[[0., 0., 0., 0.], 
Oe Ba, 0.], 
[0., 0., 0., 


© 
` 
© 
Daf 
= 
` 


© 


[[0... 5. 0s 
On, Gs 
lOe Ox, Cr, 


sJ; 
s] 
.]]], dtype=float128) 


© 
` 
© 
`~ 
© © 


In [63]: d = np.ones_like(c, dtype='f16', order='C') © 
d 

Out[63]: array([[[1., 1., 1., 

EA Se 

[a Me, Be 


RRR 
pee i 
ose N 
" 


[ [lee tes i 
TANE eee oe 
fix, Be. Des 


pi 


1, 
slo 
.]]], dtype=float128) 


Hå 
m 


m 


In [64]: e = np.empty((2, 3, 2)) (4) 
e 
Out[64]: array([[[0.00000000e+000, 0.00000000e+000], 
[0.00000000e+000, 0.00000000e+000], 
[0.00000000e+000, 0.00000000e+000]], 


[[0.00000000e+000, 0.00000000e+000], 
[0.00000000e+000, 7.49874326e+247], 
[1.28822975e-231, 4.33190018e-311]]]) 


In [65]: f = np.empty_like(c) (4) 
fF 
Out[65]: array([[[ 0, 0, 0, 0], 
[ 0, 0, 0, 0], 
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(6) 


[ 0, ð, 0, 0]], 


[[ 0, 0, 0, 0], 
0, 0, 740455269, 1936028450], 

[ 0, 268435456, 1835316017, 2041]]], dtype=int32) 
In [66]: np.eye(5) (5) 
Out[66]: array([[1., 0., 0., 0., 0.], 

[0., 1., 0., 0., 0.], 

[O55 Oss Bes Ois Orly 

Ois D Cis te Or] 5 

O 0., 0., 0., 1.]]) 


In [67]: g = np.linspace(5, 15, 12) (6) 


g 
Out[67]: array([ 5. » 5.90909091, 6.81818182, 7.72727273, 8.63636364, 
9.54545455, 10.45454545, 11.36363636, 12.27272727, 13.18181818, 
14.09090909, 15. 1) 


Creates an ndarray object prepopulated with zeros. 
Creates an ndarray object prepopulated with ones. 
The same, but takes another ndarray object to infer the shape. 


Creates an ndarray object not prepopulated with anything (numbers depend on 
the bits present in the memory). 


Creates a square matrix as an ndarray object with the diagonal populated by 
ones. 


Creates a one-dimensional ndarray object with evenly spaced intervals between 
numbers; parameters used are start, end, and num (number of elements). 


For all these functions, one can provide the following parameters: 


shape 


Either an int, a sequence of int objects, or a reference to another ndarray 


dtype (optional) 


A dtype—these are NumPy-specific data types for ndarray objects 


order (optional) 


The order in which to store elements in memory: C for C-like (i.e., row-wise) or F 
for Fortran-like (i.e., column-wise) 


Here, it becomes obvious how NumPy specializes the construction of arrays with the 
ndarray class, in comparison to the list -based approach: 
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e The ndarray object has built-in dimensions (axes). 
e The ndarray object is immutable; its length (size) is fixed. 


¢ It only allows for a single data type (np.dtype) for the whole array. 


The array class by contrast shares only the characteristic of allowing for a single data 
type (type code, dtype). 

The role of the order parameter is discussed later in the chapter. Table 4-1 provides 
an overview of selected np.dtype objects (i.e., the basic data types NumPy allows). 


Table 4-1. NumPy dtype objects 


dtype Description Example 


? Boolean ? (True or False) 

i Signed integer i8 (64-bit) 

u Unsigned integer u8 (64-bit) 

f Floating point f8 (64-bit) 

c Complex floating point c32 (256-bit) 

m timedelta m (64-bit) 

M datetime M (64-bit) 

0 Object O (pointer to object) 

U Unicode U24 (24 Unicode characters) 

V Raw data (void) V12 (12-byte data block) 
Metainformation 


Every ndarray object provides access to a number of useful attributes: 


In [68]: g.size (1) 
Out[68]: 12 


In [69]: g.itemsize (2) 
Out[69]: 8 


In [70]: g.ndim © 
Out[70]: 1 


In [71]: g.shape (4) 
Out[71]: (12,) 


In [72]: g.dtype (5) 
Out[72]: dtype('float64') 


In [73]: g.nbytes (6) 
Out[73]: 96 
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The number of elements. 

The number of bytes used to represent one element. 
The number of dimensions. 

The shape of the ndarray object. 


The dtype of the elements. 


© © 6 © 8 8 


The total number of bytes used in memory. 


Reshaping and Resizing 


Although ndarray objects are immutable by default, there are multiple options to 
reshape and resize such an object. While reshaping in general just provides another 
view on the same data, resizing in general creates a new (temporary) object. First, 
some examples of reshaping: 


In [74]: g = np.arange(15) 


In [75]: g 
Out 75]: array([ @; 2. S 3B. 4, S,. 6). TT gl 9, 20, 11 12, 135. 141) 


In [76]: g.shape (13 
Out[76]: (15,) 


In [77]: np.shape(g) (1) 
Out[77]: (15,) 


In [78]: g.reshape((3, 5)) (2) 

Out[78]: array([[ 0, 1, 2, 3, 4], 
[5, 6, 7, 8 9], 
[10, 11, 12, 13, 14]]) 


In [79]: h = g.reshape((5, 3)) © 
h 

Out[79]: array([ a 2215 

4 A 

7; 8]; 

E 46, 22), 

13, 14]]) 


` 


In [80]: h.T @ 
Out[80]: array([ 


w 
` 


6; 9, 12]; 
7; 16, 13]; 
8, 11, 14]]) 


[ 0, 
[ 4, 
[2 


2 


wu 
ee 


In [81]: h.transpose() (4) 
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Out[81]: array([[ 0, 3, 6, 9, 12], 
[ts A Fy. 40s 031. 
[2 


» 5, 8, 11, 14]]) 


The shape of the original ndarray object. 


© 


Reshaping to two dimensions (memory view). 


© 


Creating a new object. 


The transpose of the new ndarray object. 


During a reshaping operation, the total number of elements in the ndarray object is 
unchanged. During a resizing operation, this number changes—it either decreases 
(“down-sizing”) or increases (“up-sizing”). Here some examples of resizing: 


In [82]: g 
Out[82]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) 


In [83]: np.resize(g, (3, 1)) 1] 
Out[83]: array([[0], 

[1], 

[2]]) 


In [84]: np.resize(g, (1, 5)) 1] 
Out[84]: array([[0, 1, 2, 3, 4]]) 


In [85]: np.resize(g, (2, 5)) (1) 
Out[85]: array([[0, 1, 2, 3, 4], 
[55 6, 75 8, 9]]) 


In [86]: n = np.resize(g, (5, 4)) (2) 
n 


Out[86]: array([[ 0, 1, 2, 3], 
Lä S. 6 Th 
[ 8 9, 10,11]; 
[12, 13, 14, 0], 
[ $; 25. 33 411) 


@ Two dimensions, down-sizing. 


© Two dimensions, up-sizing. 


Stacking is a special operation that allows the horizontal or vertical combination of 
two ndarray objects. However, the size of the “connecting” dimension must be the 
same: 
In [87]: h 
Out[87]: array([[ 0, 1, 2], 
[ 3 3 4, 5] 3 
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In [88]: 
Out[88]: 


In [89]: 
Out[89]: 


[ 6, Ts 8], 
[ 9, 10, 11], 
[12, 13, 14]]) 


np.hstack((h, 2 * h)) (13 


array([[ 0, 1, 2, 0, 2, 4], 
[ 3) 4. 5S -6;. 8, 20], 
[6, 7, 8, 12, 14, 16], 
[ 9, 10, 11, 18, 20, 22], 


(12, 13, 14, 24, 26, 28]]) 


np.vstack((h, 0.5 * h)) @ 
array([[ ©. > ds 5 2s l; 
[ Sey @e.4 Se ], 
BS : 
sA 5. 34. y 
» 413. 


` 


5], 
S 
Ši; 
-1D 


DNnRWRPONW DAD 
uw wn 
` ~ 
NnuOwWwWn dw 

` 


v 
. 
NU BNR 


` 


© Horizontal stacking of two ndarray objects. 


© Vertical stacking of two ndarray objects. 


Another special operation is the flattening of a multidimensional ndarray object to a 
one-dimensional one. One can choose whether the flattening happens row-by-row (C 
order) or column-by-column (F order): 


In [90]: 
Out [90]: 


In [91]: 
Out [91]: 


In [92]: 
Out[92]: 


In [93]: 
Out[93]: 


In [94]: 


In [95]: 


h 

array([[ 9, 1, 2], 
[ 35. 4) S]; 
[ 6; % 8l, 
[ 9, 10, 11], 
[12, 13, 14]]) 


h.flatten() (1) 
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) 


h.flatten(order='C') (1) 
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) 


h.flatten(order='F') (2) 
array([ ©; 3, 6; 9; 12, 2, 4, 7) 10) 43, 2, 5, 8) 24, a) 


for i in h.flat: © 

print(i, end=',') 
0,1,2,3;4,5,6,7,8,9,10,11,12,13,34, 
for i in h.ravel(order='C'): 


100 | Chapter 4: Numerical Computing with NumPy 


o © 8 8 


print(i, end=',') 
0;1,2,3,4;5,6,7,8;9,10, 11,412 ,13,,14, 
In [96]: for i in h.ravel(order='F'): 

print(i, end=',') 
036,912,514, 7,10;513,,2,5,8, 12,14, 


Flattening with F order. 


Boolean Arrays 


Comparison and logical operations in general work on ndarray objects the same way, 
element-wise, as on standard Python data types. Evaluating conditions yield by 
default a Boolean ndarray object (dtype is bool): 


In [97]: 
Out[97]: 


In [98]: 
Out[98]: 


In [99]: 
Out[99]: 


In [100]: 
Out[100]: 


In [101]: 
Out[101]: 


h 
array([ 


h>s © 

array([[False, 
[False, 
[False, 
[ True, 
[ True, 


h<7 @ 

array([[ True, 
[ True, 
[ True, 
[False, 
[False, 


h=-5 © 

array([[False, 
[False, 
[False, 
[False, 
[False, 


2], 

5], 

8], 
11], 
14]]) 


False, 
False, 
False, 
True, 
True, 


True, 
True, 
True, 
False, 
False, 


False, 
False, 
False, 
False, 
False, 


The default order for flattening is C. 


The flat attribute provides a flat iterator (C order). 


The ravel() method is an alternative to flatten(). 


False], 

False], 

False], 
True], 
True]]) 


True], 

True], 
False], 
False], 
False]]) 


False], 
True], 
False], 
False], 
False]]) 


(h == 5).astype(int) (4) 
array([[0, 0, 0], 
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[0, 0, 1], 
[0, 0, 0], 
[0, 0, 0], 
[0, 0, ®]]) 


In [102]: (h > 4) & (h <= 12) © 

Out[102]: array([[False, False, False], 
[False, False, True], 
[ True, True, True], 
[ True, True, True], 
[ True, False, False]]) 


Is value greater than ...? 
Is value smaller or equal than ...? 


Is value equal to ...? 


© © 8 8 


Present True and False as integer values 0 and 1. 


© Is value greater than ... and smaller than or equal to ...? 


Such Boolean arrays can be used for indexing and data selection. Notice that the fol- 
lowing operations flatten the data: 


In [103]: h[h > 83] © 
Out[103]: array([ 9, 10, 11, 12, 13, 14]) 

In [104]: h[(h > 4) & (h <= 12)] @ 

Out[104]: array([ 5, 6, 7, 8, 9, 10, 11, 12]) 


In [105]: h[(h < 4) | (h >= 12)] © 
Out[105]: array([ 0, 1, 2, 3, 12, 13, 14]) 


@ Give me all values greater than ... 
© Give me all values greater than ... and smaller than or equal to ... 


© Give me all values greater than ... or smaller than or equal to ... 


A powerful tool in this regard is the np.where() function, which allows the definition 
of actions/operations depending on whether a condition is True or False. The result 
of applying np.where() is a new ndarray object of the same shape as the original one: 


In [106]: np.where(h > 7, 1, 0) (1) 
Out[106]: array([[0, 0, 0], 

[0, 0, 0], 

[0, 0, 1], 

Ds aie sl 

[1, 1, 1]]) 
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In [107]: np.where(h % 2 == 0, 'even', 'odd') (2) 
Out[107]: array([['even', 'odd', 'even'], 

['odd', 'even', 'odd'], 

['even', 'odd', 'even'], 

['odd', 'even', 'odd'], 

['even', 'odd', 'even']], dtype='<U4') 


In [108]: np.where(h <= 7, h* 2, h / 2) © 
Out[108]: array([[ 0. , 2., 4. ], 

[65-3 Bs « 106-15 

[ty fae, 4. 1, 

[ 4355 Ssa S55 

[6s 5 665, 7s ]]) 


In the new object, set 1 if True and 0 otherwise. 
In the new object, set even if True and odd otherwise. 


In the new object, set two times the h element if True and half the h element 
otherwise. 


Later chapters provide more examples of these important operations on ndarray 
objects. 


Speed Comparison 


We'll move on to structured arrays with NumPy shortly, but let us stick with regular 
arrays for a moment and see what the specialization brings in terms of performance. 


As a simple example, consider the generation of a matrix/array of shape 5,000 x 5,000 
elements, populated with pseudo-random, standard normally distributed numbers. 
The sum of all elements shall then be calculated. First, the pure Python approach, 
where list comprehensions are used: 


In [109]: import random 
I = 5000 


In [110]: %time mat = [[random.gauss(0, 1) for j in range(I)] \ 
for i in range(I)] (1) 
CPU times: user 17.1 s, sys: 361 ms, total: 17.4 s 
Wall time: 17.4 s 


In [111]: mat[0][:5] @ 

Out[111]: [-0.40594967782329183, 
-1.357757478015285, 
0.05129566894355976, 
-0.8958429976582192, 
0.6234174778878331] 
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In [112]: %time sum([sum(l) for l in mat]) © 
CPU times: user 142 ms, sys: 1.69 ms, total: 144 ms 
Wall time: 143 ms 


Out[112]: -3561.944965714259 


In [113]: import sys 
sum([sys.getsizeof(l) for l in mat]) (4) 
Out[113]: 215200000 


The creation of the matrix via a nested list comprehension. 
Some selected random numbers from those drawn. 


The sums of the single list objects are first calculated during a list comprehen- 
sion; then the sum of the sums is taken. 


© This adds up the memory usage of all List objects. 


Let us now turn to NumPy and see how the same problem is solved there. For conve- 
nience, the NumPy subpackage random offers a multitude of functions to instantiate an 
ndarray object and populate it at the same time with pseudo-random numbers: 

In [114]: %time mat = np.random.standard_normal((I, I)) (1) 


CPU times: user 1.01 s, sys: 200 ms, total: 1.21 s 
Wall time: 1.21 s 


In [115]: %time mat.sum() (2) 
CPU times: user 29.7 ms, sys: 1.15 ms, total: 30.8 ms 
Wall time: 29.4 ms 

Out[115]: -186.12767026606448 


In [116]: mat.nbytes © 
Out[116]: 200000000 


In [117]: sys.getsizeof(mat) © 
Out[117]: 200000112 


@ Creates the ndarray object with standard normally distributed random numbers; 
it is faster by a factor of about 14. 


© Calculates the sum of all values in the ndarray object; it is faster by a factor of 
4.5. 


© The NumPy approach also saves some memory since the memory overhead of the 
ndarray object is tiny compared to the size of the data itself. 
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Using NumPy Arrays 


The use of NumPy for array-based operations and algorithms gener- 
ally results in compact, easily readable code and significant perfor- 


mance improvements over pure Python code. 


Structured NumPy Arrays 


The specialization of the ndarray class obviously brings a number of valuable bene- 
fits with it. However, a too narrow specialization might turn out to be too large a bur- 
den to carry for the majority of array-based algorithms and applications. Therefore, 
NumPy provides structured ndarray and record recarray objects that allow you to 
have a different dtype per column. What does “per column” mean? Consider the fol- 


lowing initialization of a structured ndarray object: 


© © 8 8 


In [118]: 


In [119]: 
Out[119]: 


In [120]: 


In [121]: 
Out[121]: 


In [122]: 


In [123]: 
Out[123]: 


In [124]: 
Out[124]: 


dt = np.dtype([('Name', 'S10'), ('Age', 'i4'), 


('Height', 'f'), ('Children/Pets', 'i4', 2)]) (1) 


dt @ 
dtype([('Name', 'S10'), ('Age', '<i4'), ('Height', '<f4'), 
('Children/Pets', '<i4', (2,))]) 


dt = np.dtype({'names': ['Name', 'Age', 'Height', 'Children/Pets'], 


'formats':'0 int float int,int'.split()}) (2) 


dt @ 
dtype([('Name', '0'), ('Age', '<i8'), ('Height', '<f8'), 
('Children/Pets', [('f0', '<i8'), ('f1', '<i8')])]) 


s = np.array([('Smith', 45, 1.83, (0, 1)), 
('Jones', 53, 1.72, (2, 2))], dtype=dt) © 


s © 


array([('Smith', 45, 1.83, (0, 1)), ('Jones', 53, 1.72, (2, 2))], 


dtype=[('Name', '0'), ('Age', '<i8'), ('Height', '<f8'), 
('Children/Pets', [('fO', '<i8'), ('f1', '<i8')])]) 


type(s) (4) 


numpy .ndarray 


The complex dtype is composed. 


An alternative syntax to achieve the same result. 


The structured ndarray is instantiated with two records. 


The object type is still ndarray. 


Structured NumPy Arrays 


In a sense, this construction comes quite close to the operation for initializing tables 
in a SQL database: one has column names and column data types, with maybe some 
additional information (e.g., maximum number of characters per str object). The 
single columns can now be easily accessed by their names and the rows by their index 
values: 


In [125]: s['Name'] (1) 

Out[125]: array(['Smith', 'Jones'], dtype=object) 
In [126]: s['Height'].mean() (2) 

Out[126]: 1.775 

In [127]: s[0] © 

Out[127]: ('Smith', 45, 1.83, (0, 1)) 

In [128]: s[1]['Age'] ® 

Out[128]: 53 


@ Selecting a column by name. 
© Calling a method on a selected column. 
© Selecting a record. 


© Selecting a field in a record. 


In summary, structured arrays are a generalization of the regular ndarray object type 
in that the data type only has to be the same per column, like in tables in SQL data- 
bases. One advantage of structured arrays is that a single element of a column can be 
another multidimensional object and does not have to conform to the basic NumPy 
data types. 


Structured Arrays 


NumPy provides, in addition to regular arrays, structured (and 
record) arrays that allow the description and handling of table-like 
data structures with a variety of different data types per (named) 
column. They bring SQL table-like data structures to Python, with 
most of the benefits of regular ndarray objects (syntax, methods, 
performance). 


Vectorization of Code 


Vectorization is a strategy to get more compact code that is possibly executed faster. 
The fundamental idea is to conduct an operation on or to apply a function to a com- 
plex object “at once” and not by looping over the single elements of the object. In 
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Python, functional programming tools such as map() and filter() provide some 
basic means for vectorization. However, NumPy has vectorization built in deep down 
in its core. 


Basic Vectorization 


As demonstrated in the previous section, simple mathematical operations—such as 
calculating the sum of all elements—can be implemented on ndarray objects directly 
(via methods or universal functions). More general vectorized operations are also 
possible. For example, one can add two NumPy arrays element-wise as follows: 

In [129]: np.random.seed(100) 


r = np.arange(12).reshape((4, 3)) (1) 
s = np.arange(12).reshape((4, 3)) * 0.5 (2) 


In [130]: r @ 

Out[130]: array([[ 0, 1, 2], 
L3. w% Sly 
Le 7, 8], 
[ 9, 10, 11]]) 


In [131]: s @ 

Out[131]: array([[0. , 0.5, 1. J; 
[1.5, 2. , 2.5]; 
(3. 5 305, 4% J; 
[4.5, 5. , 5.5]]) 


In [132]: r +s © 

Out[132]: array([[ 0. , 1.5, 3. J], 
LaS Bss Sl, 
[ 9. , 10.5, 12. ], 
[13.5, 15. , 16.5]]) 


@ The first ndarray object with random numbers. 
@ The second ndarray object with random numbers. 


© Element-wise addition as a vectorized operation (no looping). 


NumPy also supports what is called broadcasting. This allows you to combine objects 
of different shape within a single operation. Previous examples have already made 
use of this. Consider the following examples: 


In [133]: r+3 © 

Out[133]: array([[ 3, 4, 5], 
[6, 7, 8], 
[ 9; 10, 11]; 
[i2;. 13; 14]]). 


In [134]: 2* r © 
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Out[134]: array([[ 0, 2, 
[6, 8, 
[12, 
[18, 


In [135]: 2*r+3 © 

Out[135]: array([[ 3, 5, 
[ 9, 11, 
EMETA 
[21, 23, 


4], 
10], 
16], 
22]]) 


7], 
13], 
19], 
25]]) 


During scalar addition, the scalar is broadcast and added to every element. 


During scalar multiplication, the scalar is also broadcast to and multiplied with 


every element. 


This linear transformation combines both operations. 


These operations work with differently shaped ndarray objects as well, up to a certain 


© 
point: 
In [136]: r 
Out[136]: array([[ 9, 1, 
[ 3; 4, 
[ 6, as 
[ 9, 10, 
In [137]: r.shape 


Out[137]: (4, 3) 


In [138]: s = 
s 
Out[138]: array([0, 4, 8]) 
In [139]: r+s (2) 
Out[139]: array([[ 0, 5, 
[ 3, 8, 
[ 6, 42, 
[S 24; 
In [140]: s = 
s 
Out[140]: array([0, 3, 6, 


In [141]: r +s 


2], 

5], 

8], 
11]]) 


np.arange(0, 12, 4) (1) 


10], 
13], 
16], 
19]]) 


np.arange(0, 12, 3) © 


9]) 


ValueErrorTraceback (most recent call last) 
<ipython-input-141-1890b26ec965> in <module>() 


a BP aS 


108 


| Chapter 4: Numerical Computing with NumPy 


ValueError: operands could not be broadcast together 
with shapes (4,3) (4,) 


In [142]: r.transpose() + s (5) 

Out[142]: array([[ 0, 6, 12, 18], 
[1, 7, 13, 19], 
[ 2, 8, 14, 20]]) 


N 


In [143]: sr = s.reshape(-1, 1) (6) 


sr 

Out[143]: array([[0], 
[3], 

[6], 

[9]]) 


In [144]: sr.shape 6] 
Out[144]: (4, 1) 


In [145]: r + s.reshape(-1, 1) (6) 
Out[145]: array([[ 0, 1, 2], 
[ 6, Es 8] 3 
[12, 13, 14], 
[18, 19, 20]]) 
A new one-dimensional ndarray object of length 3. 
The r (matrix) and s (vector) objects can be added straightforwardly. 


Another one-dimensional ndarray object of length 4. 


o © 8 86 


The length of the new s (vector) object is now different from the length of the 
second dimension of the r object. 


© 


Transposing the r object again allows for the vectorized addition. 
Alternatively, the shape of s can be changed to (4, 1) to make the addition work 
(the results are different, however). 


Often, custom-defined Python functions work with ndarray objects as well. If the 
implementation allows, arrays can be used with functions just as int or float objects 
can. Consider the following function: 


In [146]: def f(x): 
return 3 *x+5 © 


In [147]: f(0.5) @ 
Out[147]: 6.5 


In [148]: f(r) © 
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Out[148]: array([[ 5, 8, 11], 
[14, 17, 20], 
[23, 26, 29], 
[32, 35, 38]]) 


A simple Python function implementing a linear transform on parameter x. 
The function f() applied to a Python float object. 


The same function applied to an ndarray object, resulting in a vectorized and 
element-wise evaluation of the function. 


What NumPy does is to simply apply the function f to the object element-wise. In that 
sense, by using this kind of operation one does not avoid loops; one only avoids them 
on the Python level and delegates the looping to NumPy. On the NumPy level, looping 
over the ndarray object is taken care of by optimized code, most of it written in C 
and therefore generally faster than pure Python. This explains the "secret" behind 
the performance benefits of using NumPy for array-based use cases. 


Memory Layout 


When ndarray objects are initialized by using np.zeros(), as in “Multiple Dimen- 
sions” on page 94, an optional argument for the memory layout is provided. This 
argument specifies, roughly speaking, which elements of an array get stored in mem- 
ory next to each other (contiguously). When working with small arrays, this has 
hardly any measurable impact on the performance of array operations. However, 
when arrays get large, and depending on the (financial) algorithm to be implemented 
on them, the story might be different. This is when memory layout comes into play 
(see, for instance, Eli Bendersky’s article “Memory Layout of Multi-Dimensional 
Arrays”). 


To illustrate the potential importance of the memory layout of arrays in science and 
finance, consider the following construction of multidimensional ndarray objects: 


In [149]: x = np.random.standard_normal((1000000, 5)) (1) 
In [150]: y=2*x+3 @ 

In [151]: C = np.array((x, y), order='C') © 

In [152]: F = np.array((x, y), order='F') (4) 

In [153]: x = 0.0; y= 0.0 © 

C[:2].round(2) Q 


array([[[-1.75, 0.34, 1.15, -0.25, 0.98], 
[ 0.51, 0.22, -1.07, -0.19, 0.26], 
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Some numbers from the C object. 


58, 


ae ras 
1.23; 
1:03, 


aot, 
86, 
83, 


Erp 
.54, 
06; 


H 


.82, 0.67], 


353, 1.39]; 
.68, -0.87], 
.34, -0.46]] 


z 4.96], 


bA Sesil; 
663; 4:35]; 


.67, 5.78], 


un 
` 
H 


26], 


.69, 2.07]]]) 


A linear transform of the original object data. 


Memory is freed up (contingent on garbage collection). 


An ndarray object with large asymmetry in the two dimensions. 


This creates a two-dimensional ndarray object with C order (row-major). 


This creates a two-dimensional ndarray object with F order (column-major). 


Let’s look at some fundamental examples and use cases for both types of ndarray 
objects and consider the speed with which they are executed given the different mem- 


ory layouts: 


In [155]: %timeit C 
4.36 ms + 


In [156]: %timeit F. 
4.21 ms + 


In [157]: %timeit C. 


17.9 ms + 


In [158]: %timeit C 
35.1 ms. + 


In [159]: %timeit F 


83.8 ms + 


In [160]: %timeit F 
67.9 ms + 


In [161]: 


.sum() (1) 


89.3 us per loop (mean 


sum() (1) 


71.4 us per loop (mean 


sum(axis=0) (2) 


776 us per loop (mean 


.sum(axis=1) © 


999 us per loop (mean 


.sum(axis=0) (2) 


2.63 ms per loop (mean 


.sum(axis=1) © 


5.16 ms per loop (mean 


+ 


std. dev. of 7 runs, 100 loops each) 


std. dev. of 7 runs, 100 loops each) 


std. dev. of 7 runs, 100 loops each) 


std. dev. of 7 runs, 10 loops each) 


std. dev. of 7 runs, 10 loops each) 


std. dev. of 7 runs, 10 loops each) 
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@ Calculates the sum of all elements. 
© Calculates the sums per row (“many”). 


© Calculates the sums per columns (“few”). 
We can summarize the performance results as follows: 
e When calculating the sum of all elements, the memory layout does not really 
matter. 


¢ The summing up over the C-ordered ndarray objects is faster both over rows and 
over columns (an absolute speed advantage). 


e With the C-ordered (row-major) ndarray object, summing up over rows is rela- 
tively faster compared to summing up over columns. 


e With the F-ordered (column-major) ndarray object, summing up over columns 
is relatively faster compared to summing up over rows. 


Conclusion 


NumPy is the package of choice for numerical computing in Python. The ndarray class 
is specifically designed to be convenient and efficient in the handling of (large) 
numerical data. Powerful methods and NumPy universal functions allow for vectorized 
code that mostly avoids slow loops on the Python level. Many approaches introduced 
in this chapter carry over to pandas and its DataFrame class as well (see Chapter 5). 


Further Resources 


Many helpful resources are provided at the NumPy website: 
° http://www.numpy.org/ 
Good introductions to NumPy in book form are: 


e McKinney, Wes (2017). Python for Data Analysis. Sebastopol, CA: O’Reilly. 


e VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 
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CHAPTER 5 
Data Analysis with pandas 


Data! Data! Data! I can’t make bricks without clay! 
—Sherlock Holmes 


This chapter is about pandas, a library for data analysis with a focus on tabular data. 
pandas is a powerful tool that not only provides many useful classes and functions 
but also does a great job of wrapping functionality from other packages. The result is 
a user interface that makes data analysis, and in particular financial analysis, a conve- 
nient and efficient task. 


This chapter covers the following fundamental data structures: 
Object type Meaning Used for 


DataFrame 2-dimensional data object with index Tabular data organized in columns 


Series 1-dimensional data object with index Single (time) series of data 


The chapter is organized as follows: 


“The DataFrame Class” on page 114 
This section starts by exploring the basic characteristics and capabilities of the 
DataFrame class of pandas by using simple and small data sets; it then shows how 
to transform a NumPy ndarray object into a DataFrame object. 


“Basic Analytics” on page 123 and “Basic Visualization” on page 126 
Basic analytics and visualization capabilities are introduced in these sections 
(later chapters go deeper into these topics). 


“The Series Class” on page 128 
This rather brief section covers the Series class of pandas, which in a sense rep- 
resents a special case of the DataFrame class with a single column of data only. 
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“GroupBy Operations” on page 130 
One of the strengths of the DataFrame class lies in grouping data according to a 
single or multiple columns. This section explores the grouping capabilities of pan 
das. 


“Complex Selection” on page 132 
This section illustrates how the use of (complex) conditions allows for the easy 
selection of data from a DataFrame object. 


“Concatenation, Joining, and Merging” on page 135 
The combining of different data sets into one is an important operation in data 
analysis. pandas provides different options to accomplish this task, as described 
in this section. 


“Performance Aspects” on page 141 
Like Python in general, pandas often provides multiple options to accomplish the 
same goal. This section takes a brief look at potential performance differences. 


The DataFrame Class 


At the core of pandas (and this chapter) is the DataFrame, a class designed to effi- 
ciently handle data in tabular form—i.e., data characterized by a columnar organiza- 
tion. To this end, the DataFrame class provides, for instance, column labeling as well 
as flexible indexing capabilities for the rows (records) of the data set, similar to a table 
in a relational database or an Excel spreadsheet. 


This section covers some fundamental aspects of the pandas DataFrame class. The 
class is so complex and powerful that only a fraction of its capabilities can be presen- 
ted here. Subsequent chapters provide more examples and shed light on different 
aspects. 


First Steps with the DataFrame Class 


On a fundamental level, the DataFrame class is designed to manage indexed and 
labeled data, not too different from a SQL database table or a worksheet in a spread- 
sheet application. Consider the following creation of a DataFrame object: 


In [1]: import pandas as pd 1) 


In [2]: df = pd.DataFrame([10, 20, 30, 40], @ 
columns=['numbers'], © 
index=-['3"; "b" "et; “d"]) (4) 


In [3]: df © 
Out[3]: numbers 
a 10 
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d 40 
Imports pandas. 


Defines the data as a list object. 


(1) 

(2) 

© Specifies the column label. 

© Specifies the index values/labels. 
(5) 


Shows the data as well as column and index labels of the DataF rame object. 


This simple example already shows some major features of the DataFrame class when 
it comes to storing data: 


e Data itself can be provided in different shapes and types (list, tuple, ndarray, 
and dict objects are candidates). 


e Data is organized in columns, which can have custom names (labels). 


e There is an index that can take on different formats (e.g., numbers, strings, time 
information). 


Working with a DataFrame object is in general pretty convenient and efficient with 
regard to the handling of the object, e.g., compared to regular ndarray objects, which 
are more specialized and more restricted when one wants to (say) enlarge an existing 
object. At the same time, DataFrame objects are often as computationally efficient as 
ndarray objects. The following are simple examples showing how typical operations 
on a DataFrame object work: 


In [4]: df.index (1) 
Out[4]: Index(['a', 'b', 'c', 'd'], dtype='object') 


In [5]: df.columns (2) 
Out[5]: Index(['numbers'], dtype='object') 


In [6]: df.loc['c'] ® 
Out[6]: numbers 30 
Name: c, dtype: int64 


In [7]: df.loc[['a', 'd']] O 


Out[7]: numbers 
a 10 
d 40 


In [8]: df.iloc[i:3] © 
Out[8]: numbers 
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20 
G 30 


In [9]: df.sum() Q 
Out[9]: numbers 100 
dtype: int64 


In [10]: df.apply(lambda x: x ** 2) @ 
Out[10]: numbers 

100 

400 

900 
1600 


anouw 


In [11]: df ** 2 O 
Out[11]: numbers 
100 
400 
900 
1600 


an o @ 


The index attribute and Index object. 

The columns attribute and Index object. 

Selects the value corresponding to index c. 

Selects the two values corresponding to indices a and d. 
Selects the second and third rows via the index positions. 
Calculates the sum of the single column. 


Uses the apply() method to calculate squares in vectorized fashion. 


O © 6 O O 8 © Ọ 


Applies vectorization directly as with ndarray objects. 


Contrary to NumPy ndarray objects, enlarging the DataFrame object in both dimen- 
sions is possible: 


In [12]: df['floats'] = (1.5, 2.5, 3.5, 4.5) © 


In [13]: df 

Out[13]: numbers floats 
a 10 1.5 
b 20 PE 
č 30 ce 
d 40 4.5 


In [14]: df['floats'] @ 
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Out[14]: a 


(oz 
Ununu 


fa) 
BRWN PR 


Name: floats, dtype: float64 
@ Adds anew column with float objects provided as a tuple object. 


@ Selects this column and shows its data and index labels. 


A whole DataFrame object can also be taken to define a new column. In such a case, 
indices are aligned automatically: 


In [15]: df['names'] = pd.DataFrame(['Yves', 'Sandra', 'Lilli', 'Henry'], 
‘index=["d", "a", "b^, “e"]) (1) 


In [16]: df 

Out[16]: numbers floats names 
a 10 1.5 Sandra 
b 20 2:5 LAU 
c 30 3.5 Henry 
d 40 4.5 Yves 


@ Another new column is created based on a DataFrame object. 


Appending data works similarly. However, in the following example a side effect is 
seen that is usually to be avoided—namely, the index gets replaced by a simple range 
index: 


In [17]: df.append({'numbers': 100, 'floats': 5.75, 'names': 'Jil'}, 
ignore_index=True) (1) 
Out[17]: numbers floats names 
10 1.50 Sandra 
20 2.50 Ee 
30 3.50 Henry 
40 4.50 Yves 
100 be ar fs) Jil 


PUNEO 


In [18]: df = df.append(pd.DataFrame({'numbers': 100, 'floats': 5.75, 
'names': 'Jil'}, index=['y',])) @ 


In [19]: df 

Out[19]: numbers floats names 
a 10 1.50 Sandra 
b 20 2.50 Lilli 
é 30 3.50 Henry 
d 40 4.50 Yves 
y 100 5.75 Jil 


In [20]: df = df.append(pd.DataFrame({'names': 'Liz'}, index=['z',]), 
sort=False) 
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In [21]: df 
Out[21]: numbers floats names 


a 10.0 1.50 Sandra 
b 20.0 2.50 Lilt 
Ç 30.0 3.50 Henry 
d 40.0 4.50 Yves 
y 100.0 5.75 Hil 
Z NaN NaN Liz 


In [22]: df.dtypes (4) 

Out[22]: numbers float64 
floats float64 
names object 
dtype: object 


@ Appends a new row via a dict object; this is a temporary operation during which 
index information gets lost. 


© Appends the row based on a DataFrame object with index information; the origi- 
nal index information is preserved. 


© Appends an incomplete data row to the DataFrame object, resulting in NaN val- 
ues. 


© Returns the different dtypes of the single columns; this is similar to what’s possi- 
ble with structured ndarray objects. 


Although there are now missing values, the majority of method calls will still work: 


In [23]: df[['numbers', 'floats']].mean() © 
Out[23]: numbers 40.00 

floats 3.55 

dtype: float64 


In [24]: df[['numbers', 'floats']].std() e 
Out[24]: numbers 35.355339 

floats 1.662077 

dtype: float64 


@ Calculates the mean over the two columns specified (ignoring rows with NaN val- 
ues). 


© Calculates the standard deviation over the two columns specified (ignoring rows 
with NaN values). 


118 | Chapter 5: Data Analysis with pandas 


Second Steps with the DataFrame Class 


The example in this subsection is based on an ndarray object with standard normally 
distributed random numbers. It explores further features such as a DatetimeIndex to 


manage time series data: 


In [25]: import numpy as np 


In [26]: np.random.seed(100) 


In [27]: a = np.random.standard_normal((9, 


In [28]: a 
Out[28]: array([ 


[-1. 
[ o. 
[-0. 
E 
[-0. 
[ 1. 
[ o. 
[-0. 
[-0. 


74976547, 
98132079, 
18949583, 
58359505, 
53128038, 
61898166, 
18451869, 
32623806, 
75635231, 


ooorRrRPO OOO 


. 3426804 
. 51421884, 

-25500144, - 
-81684707, 

.02973269, = 
-54160517, - 
- 9370822 
-05567601, 
-81645401, 


’ 


’ 


4)) 


1.3530358 , 
0.22117967, 
0.45802699, 
0.67272081, 
0.43813562, 
0.25187914, 
0.73100034, 
0. 22239961, 
0.75044476, 


25243604], 
07004333], 


-43516349], 


10441114], 
11831825], 
84243574], 


-36155613], 


443217 ], 


.45594693]]) 


Although one can construct DataFrame objects more directly (as seen before), using 
an ndarray object is generally a good choice since pandas will retain the basic struc- 
ture and will “only” add metainformation (e.g., index values). It also represents a typ- 
ical use case for financial applications and scientific research in general. For example: 


In [29]: df = pd.DataFrame(a) (1) 


In [30]: df 
Out[30]: 0 
-1.749765 
0.981321 
-0.189496 
-0583595 
. 531280 
1.618982 
0.184519 
-0.326238 
0.756352 


ANANUHRWNEH © 
1 
© 


@ Creates a DataFrame object from the ndarray object. 
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1 


. 342680 
.514219 
.255001 
.816847 
-029733 
.541605 
.937082 
-055676 
-816454 


1 
qooooooo or 


2 
. 153036 
: 221180 
. 458027 
-672721 
. 438136 
s251879 
. 731000 
- 222400 
. 750445 


“05 
sali 

0. 
-0. 
salt. 
=y 

Ls 
=i, 
-0. 


3 
252436 
070043 
435163 
104411 
118318 
842436 
361556 
443217 
455947 


Table 5-1 lists the parameters that the DataFrame() function takes. In the table, 
“array-like” means a data structure similar to an ndarray object—a list, for exam- 
ple. Index is an instance of the pandas Index class. 


The DataFrame Class | 119 


Table 5-1. Parameters of DataFrame() function 


Parameter Format Description 

data ndarray/dict/DataFrame Data for DataFrame; dict can contain Series, ndarray, list 
index Index/array-like Index to use; defaults to range(n) 

columns Index/array-like Column headers to use; defaults to range(n) 

dtype dtype, default None Data type to use/force; otherwise, it is inferred 

copy bool, default None Copy data from inputs 


As with structured arrays, and as seen before, DataFrame objects have column names 
that can be defined directly by assigning a List object with the right number of ele- 
ments. This illustrates that one can define/change the attributes of the DataFrame 
object easily: 


In [31]: df.columns = ['No1', 'No2', 'No3', 'No4'] (1) 


In [32]: df 

Out[32]: No1 No2 No3 No4 
@ -1.749765 0.342680 1.153036 -0.252436 
1 0.981321 0.514219 0.221180 -1.070043 
2 -0.189496 0.255001 -0.458027 0.435163 
3 -0.583595 0.816847 0.672721 -0.104411 
4 -0.531280 1.029733 -0.438136 -1.118318 
5 1.618982 1.541605 -0.251879 -0.842436 
6 0.184519 0.937082 0.731000 1.361556 
7 -0.326238 0.055676 0.222400 -1.443217 
8 -0.756352 0.816454 0.750445 -0.455947 


In [33]: df['No2'].mean() (2) 
Out[33]: 0.7010330941456459 


@ Specifies the column labels via a list object. 


@ Picking a column is now made easy. 


To work with financial time series data efficiently, one must be able to handle time 
indices well. This can also be considered a major strength of pandas. For example, 
assume that our nine data entries in the four columns correspond to month-end data, 
beginning in January 2019. A DatetimeIndex object is then generated with the 
date_range() function as follows: 


In [34]: dates = pd.date_range('2019-1-1', periods=9, freq='M') (1) 


In [35]: dates 
Out[35]: DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', 
'2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31', 
'2019-09-30'], 
dtype='datetime64[ns]', freq='M') 
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@ Creates a DatetimeIndex object. 


Table 5-2 lists the parameters that the date_range() function takes. 


Table 5-2. Parameters of date_range() function 


Parameter Format Description 

start string/datetime Left bound for generating dates 

end string/datetime Right bound for generating dates 

periods integer/None Number of periods (if start or end is None) 
freq string/DateOffset Frequency string, e.g., 5D for 5 days 

tz string/None Time zone name for localized index 
Normalize bool, default None Normalizes start and end to midnight 
name string, default None Name of resulting index 


The following code defines the just-created DatetimeIndex object as the relevant 
index object, making a time series of the original data set: 


In [36]: df.index = dates 


In [37]: df 

Out [37]: No1 No2 No3 No4 
2019-01-31 -1.749765 .342680 1.153036 -0.252436 
2019-02-28 0.981321 .514219 0.221180 -1.070043 
2019-03-31 -0.189496 .255001 -0.458027 0.435163 
2019-04-30 -0.583595 .816847 0.672721 -0.104411 
2019-05-31 -0.531280 .029733 -0.438136 -1.118318 
2019-06-30 1.618982 .541605 -0.251879 -0.842436 
2019-07-31 0.184519 .937082 0.731000 1.361556 
2019-08-31 -0.326238 .055676 0.222400 -1.443217 
2019-09-30 -0.756352 .816454 0.750445 -0.455947 


qaooorroao0coeo°9e 


When it comes to the generation of DatetimeIndex objects with the help of the 
date_range() function, there are a number of choices for the frequency parameter 
freq. Table 5-3 lists all the options. 


Table 5-3. Frequency parameter values for date_range() function 


Alias Description 

B Business day frequency 

C Custom business day frequency (experimental) 
D Calendar day frequency 

W Weekly frequency 

M Month end frequency 


BM Business month end frequency 
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Alias Description 


MS Month start frequency 

BMS Business month start frequency 
Q Quarter end frequency 

BQ Business quarter end frequency 
QS Quarter start frequency 

BQS Business quarter start frequency 
A Year end frequency 

BA Business year end frequency 
AS Year start frequency 

BAS Business year start frequency 
Hourly frequency 

Minutely frequency 

Secondly frequency 
Milliseconds 


Cc r HN A I 


Microseconds 


In some circumstances, it pays off to have access to the original data set in the form of 
the ndarray object. The values attribute provides direct access to it: 


In [38]: df.values 


Out[38]: array([[-1.74976547, 0.3426804 , 1.1530358 , -0.25243604], 
[ ©.98132079, ©.51421884, 0.22117967, -1.07004333], 
[-0.18949583, 0.25500144, -0.45802699, 0.43516349], 
[-0.58359505, ©.81684707, 0.67272081, -0.10441114], 
[-0.53128038, 1.02973269, -0.43813562, -1.11831825], 
[ 1.61898166, 1.54160517, -0.25187914, -0.84243574], 
[ 0.18451869, ©.9370822 , 0.73100034, 1.36155613], 
[-0.32623806, ©.05567601, 0.22239961, -1.443217 ], 
[-0.75635231, ©.81645401, 0.75044476, -0.45594693]]) 


In [39]: np.array(df) 


Out[39]: array([[-1.74976547, 0©.3426804 , 1.1530358 , -0.25243604], 
[ ©.98132079, ©.51421884, 0.22117967, -1.07004333], 
[-0.18949583, ©.25500144, -0.45802699, 0.43516349], 
[-0.58359505, ©.81684707, 0.67272081, -0.10441114], 
[-0.53128038, 1.02973269, -0.43813562, -1.11831825], 
[ 1.61898166, 1.54160517, -0.25187914, -0.84243574], 
[ 0.18451869, ©.9370822 , 0.73100034, 1.36155613], 
[-0.32623806, ©.05567601, 0.22239961, -1.443217 ], 
[-0.75635231, ©.81645401, 0.75044476, -0.45594693]]) 
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Arrays and DataFrames 


One can generate a DataFrame object from an ndarray object, but 
one can also easily generate an ndarray object out of a DataFrame 
by using the values attribute of the DataFrame class or the function 


np.array() of NumPy 


Basic Analytics 


Like the NumPy ndarray class, the pandas DataFrame class has a multitude of conve- 
nience methods built in. As a starter, consider the methods info() and describe(): 


In [40]: df.info() @ 


<class 'pandas.core.frame.DataFrame'> 


DatetimeIndex: 9 entries, 2019-01-31 to 2019-09-30 


Freq: M 


Data columns (total 4 columns): 


No1 9 non-null float64 
No2 9 non-null float64 
No3 9 non-null float64 
No4 9 non-null float64 
dtypes: float64(4) 

memory usage: 360.0 bytes 


In [41]: df.describe() @ 
Out[41]: No1 No2 


count 9.000000 9.000000 
mean -0.150212 0.701033 
std 0.988306 0.457685 
min -1.749765 0.055676 
25% -0.583595 0.342680 
50% -0.326238 0.816454 
75% 0.184519 0.937082 
max 1.618982 1.541605 


POoooo°oo wo 


No3 


- 000000 
- 289193 
«979920 
-458027 
enL879 
- 222400 
. 731000 
- 153036 


No4 


. 000000 
. 387788 
877932 
+ 443217 
. 070043 
-455947 
. 104411 
- 361556 


@ Provides metainformation regarding the data, columns, and index. 


© Provides helpful summary statistics per column (for numerical data). 


In addition, one can easily get the column-wise or row-wise sums, means, and cumu- 


lative sums: 


In [43]: df.sum() @ 

Out[43]: No1 -1.351906 
No2 6.309298 
No3 2.602739 
No4 -3.490089 
dtype: float64 


In [44]: df.mean() (2) 


Basic Analytics | 123 


Qut[44]: No14 -0.150212 
No2 0.701033 
No3 0.289193 
No4 -0.387788 
dtype: float64 


In [45]: df.mean(axis=0) @ 
Out[45]: Not -0.150212 
No2 0.701033 
No3 0.289193 
No4 -0.387788 
dtype: float64 


In [46]: df.mean(axis=1) © 

Out[46]: 2019-01-31 -0.126621 
2019-02-28 0.161669 
2019-03-31 0.010661 
2019-04-30 0.200390 
2019-05-31 -0.264500 
2019-06-30 0.516568 
2019-07-31 0.803539 
2019-08-31 -0.372845 
2019-09-30 0.088650 
Freq: M, dtype: float64 


In [47]: df.cumsum() (4) 
Out[47]: No1 No2 No3 No4 


2019-01-31 -1.749765 0.342680 1.153036 -0.252436 
2019-02-28 -0.768445 0.856899 1.374215 -1.322479 
2019-03-31 -0.957941 1.111901 0.916188 -0.887316 
2019-04-30 -1.541536 1.928748 1.588909 -0.991727 
2019-05-31 -2.072816 2.958480 1.150774 -2.110045 
2019-06-30 -0.453834 4.500086 0.898895 -2.952481 
2019-07-31 -0.269316 5.437168 1.629895 -1.590925 
2019-08-31 -0.595554 5.492844 1.852294 -3.034142 
2019-09-30 -1.351906 6.309298 2.602739 -3.490089 


Column-wise sum. 


Column-wise mean. 


Row-wise mean. 


o © 8 8 


Column-wise cumulative sum (starting at first index position). 
DataFrame objects also understand NumPy universal functions, as expected: 


In [48]: np.mean(df) (1) 
Out[48]: Not -0.150212 
No2 0.701033 
No3 0.289193 
No4 -0.387788 
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dtype: float64 


In [49]: np. log(df) 

Out[49]: 
2019-01-31 
2019-02-28 
2019-03-31 
2019-04-30 
2019-05-31 
2019-06-30 
2019-07-31 
2019-08-31 
2019-09-30 


(27 
No1 
NaN - 
-0.018856 - 
NaN - 
NaN - 
NaN 
0.481797 
-1.690005 - 
NaN - 
NaN - 


In [50]: np.sqrt(abs(df)) © 


Out[50]: 
2019-01-31 
2019-02-28 
2019-03-31 
2019-04-30 
2019-05-31 
2019-06-30 
2019-07-31 
2019-08-31 
2019-09-30 


No1 
. 322787 
-990616 
.435311 
. 763934 
. 728890 
sAr ZI92 
.429556 
-571173 
.869685 


OOOROOOOR 


In [51]: np.sqrt(abs(df)).sum() 


Out[51]: No1 


7.384345 


No2 7.075190 
No3 6.397719 
No4 7.538440 
dtype: float64 


In [52]: 100 * df + 100 © 


Out[52]: 
2019-01-31 
2019-02-28 
2019-03-31 
2019-04-30 
2019-05-31 
2019-06-30 
2019-07-31 
2019-08-31 
2019-09-30 


@ Column-wise mean. 


Element-wise natural logarithm; a warning is 


No1 
-74.976547 
198.132079 

81.050417 
41.640495 
46.871962 
261.898166 
118.451869 
67.376194 
24.364769 


OO0OOrerROOOO 


4) 


134. 
151. 
125. 
181: 
202. 
254. 
193. 
105. 
181. 


No2 
-070957 ©. 
-665106 -1. 
. 366486 
.202303 -0. 
.029299 
-432824 
-064984 -0. 
-888206 -1. 
.202785 -0. 


No2 
. 585389 
«117091 
. 504977 
.903796 
.014757 
.241614 
.968030 
235958 
.903578 


oooooooor 


No2 
268040 
421884 
500144 
684707 
973269 
160517 
708220 
567601 
645401 


through, resulting in multiple NaN values. 


No3 
142398 
508780 

NaN 
396425 

NaN 

NaN 
313341 
503279 
287089 


No3 


sO73795 
-470297 
676777 
. 820196 
-661918 
. 501876 
- 854986 
»471593 
- 866282 


215.303580 74. 
122.117967 9-7. 
54.197301 143. 
167.272081 89. 
56.186438 -11. 
74.812086 15. 
173.100034 236. 
122.239961 -44. 
175.044476 54. 


OrPrRPORrROGOOGOrF® 


© Element-wise square root for the absolute values ... 


No4 
NaN 
NaN 
- 832033 
NaN 
NaN 
NaN 
. 308628 
NaN 
NaN 


No4 
. 502430 
- 034429 
659669 
3323127 
-057506 
.917843 
. 166857 
. 201340 
- 675238 


No3 


No4 
756396 
004333 
516349 
558886 
831825 
756426 
155613 
321700 
405307 


raised but the calculation runs 
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© ... and column-wise mean values for the results. 


© A linear transform of the numerical data. 


NumPy Universal Functions 


In general, one can apply NumPy universal functions to pandas Data 
Frame objects whenever they could be applied to an ndarray object 
containing the same type of data. 


pandas is quite error tolerant, in the sense that it captures errors and just puts a NaN 
value where the respective mathematical operation fails. Not only this, but as briefly 
shown before, one can also work with such incomplete data sets as if they were com- 
plete in a number of cases. This comes in handy, since reality is characterized by 
incomplete data sets more often than one might wish. 


Basic Visualization 


Plotting of data is only one line of code away in general, once the data is stored in a 
DataFrame object (see Figure 5-1): 
In [53]: from pylab import plt, mpl (1) 
plt.style.use('seaborn') (1) 


mpl.rcParams['font.family'] = 'serif' (1) 
%matplotlib inline 


In [54]: df.cumsum().plot(lw=2.0, figsize=(10, 6)); (2) 
@ Customizing the plotting style. 


© Plotting the cumulative sums of the four columns as a line plot. 


Basically, pandas provides a wrapper around matplotplib (see Chapter 7), specifi- 
cally designed for DataFrame objects. Table 5-4 lists the parameters that the plot() 
method takes. 
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6 — No2 
= No3 
—— No4 
4 
2 
0 
-2 
Jan Feb Mar Apr May Jul Aug Sep 
2019 
Figure 5-1. Line plot of a DataFrame object 
Table 5-4. Parameters of plot() method 
Parameter Format Description 
x label/position, default None Only used when column values are x-ticks 
y label/position, default None Only used when column values are y-ticks 
subplots boolean, default False Plot columns in subplots 
sharex boolean, default True Share the x-axis 
sharey boolean, default False Share the y-axis 
use_index boolean, default True Use DataFrame. index as x-ticks 
stacked boolean, default False Stack (only for bar plots) 


sort_columns boolean, default False 


title string, default None 

grid boolean, default False 

legend boolean, default True 

ax matplotlib axis object 

style string or list/dictionary 

kind string (e.g., "Line", "bar", "barh", "kde", 
"density") 

logx boolean, default False 

logy boolean, default False 

xticks sequence, default Index 


Sort columns alphabetically before plotting 
Title for the plot 

Show horizontal and vertical grid lines 

Show legend of labels 

matplotlib axis object to use for plotting 
Line plotting style (for each column) 


Type of plot 


Use logarithmic scaling of x-axis 
Use logarithmic scaling of y-axis 


X-ticks for the plot 
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Parameter Format Description 


yticks sequence, default Values Y-ticks for the plot 

xlim 2-tuple, list Boundaries for x-axis 

ylim 2-tuple, list Boundaries for y-axis 

rot integer, default None Rotation of x-ticks 

secondary_y boolean/sequence, default False Plot on secondary y-axis 
mark_right boolean, default True Automatic labeling of secondary axis 
colormap string/colormap object, default None Color map to use for plotting 

kwds keywords Options to pass to matplotlib 


As another example, consider a bar plot of the same data (see Figure 5-2): 


In [55]: df.plot.bar(figsize=(10, 6), rot=15); (1) 
# df.plot(kind='bar', figsize=(10, 6)) (2) 


@ Plots the bar chart via .plot.bar(). 


© Alternative syntax: uses the kind parameter to change the plot type. 


Ga Nol 
1.5 mE No2 
Gag No3 
Ml No4 
1.0 
; ‘| | | 
0.0 A | 
-0.5 ij] 
-1.0 
-1.5 
0:00 0:00 00:00 00:00 00:00 00:00 .9 0:09 .90:00 .90:09 
49.033 092-28 gS 0 ph 30 OO 96-31 OP 96:30 OO 07-31 OO 98-37 OM 99-3000 
20 20 20 20 20 20 20 20 20 


Figure 5-2. Bar plot of a DataFrame object 


The Series Class 


So far this chapter has worked mainly with the pandas DataFrame class. Series is 
another important class that comes with pandas. It is characterized by the fact that it 
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has only a single column of data. In that sense, it is a specialization of the DataFrame 
class that shares many but not all of its characteristics and capabilities. A Series 
object is obtained when a single column is selected from a multicolumn DataFrame 


object: 


In [56]: 
Out[56]: 


In [57]: 


In [58]: 
Out[58]: 


In [59]: 
Out[59]: 


In [60]: 


In [61]: 
Out[61]: 


In [62]: 
Out[62]: 


type(df) 
pandas.core.frame.DataFrame 


S = pd.Series(np.linspace(0, 15, 7), name='series') 


NUN © 
ueuneuns®ed 


10. 
12: 
15.0 

Name: series, dtype: float64 


NuOBPWNPRP TOM 


type(S) 
pandas.core.series.Series 


s = df['No1'] 


s 

2019-01-31 -1.749765 
2019-02-28 0.981321 
2019-03-31 -0.189496 
2019-04-30 -0.583595 
2019-05-31 -0.531280 
2019-06-30 1.618982 
2019-07-31 0.184519 
2019-08-31 -0.326238 
2019-09-30 -0.756352 
Freq: M, Name: No1, dtype: float64 


type(s) 
pandas.core.series.Series 


The main DataFrame methods are available for Series objects as well. For illustra- 
tion, consider the mean() and plot() methods (see Figure 5-3): 


In [63]: 
Out[63]: 


In [64]: 


s.mean() 
-0.15021177307319458 


s.plot(lw=2.0, figsize=(10, 6)); 
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Figure 5-3. Line plot of a Series object 


GroupBy Operations 


pandas has powerful and flexible grouping capabilities. They work similarly to group- 
ing in SQL as well as pivot tables in Microsoft Excel. To have something to group by 
one can add, for instance, a column indicating the quarter the respective data of the 
index belongs to: 
In [65]: df['Quarter'] = ['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 
et, "Ge", a", "02" 


df 

Out[65]: No1 No2 No3 No4 Quarter 
2019-01-31 -1.749765 0.342680 1.153036 -0.252436 Q1 
2019-02-28 0.981321 0.514219 0.221180 -1.070043 Q1 
2019-03-31 -0.189496 0.255001 -0.458027 0.435163 Q1 
2019-04-30 -0.583595 0.816847 0.672721 -0.104411 Q2 
2019-05-31 -0.531280 1.029733 -0.438136 -1.118318 Q2 
2019-06-30 1.618982 1.541605 -0.251879 -0.842436 Q2 
2019-07-31 0.184519 0.937082 0.731000 1.361556 Q3 
2019-08-31 -0.326238 0.055676 0.222400 -1.443217 Q3 
2019-09-30 -0.756352 0.816454 0.750445 -0.455947 Q3 


The following code groups by the Quarter column and outputs statistics for the sin- 
gle groups: 


In [66]: groups = df.groupby('Quarter') (1) 


In [67]: groups.size() (2) 
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Out[67]: Quarter 


Q1 3 
Q2 3 
93 3 


dtype: int64 


In [68]: groups.mean() © 


Out[68]: No1 No2 No3 No4 
Quarter 
Q1 -0.319314 0.370634 0.305396 -0.295772 
Q2 0.168035 1.129395 -0.005765 -0.688388 
Q3 -0.299357 0.603071. 0.567948 -0.179203 


In [69]: groups.max() (4) 


Out[69]: No1 No2 No3 No4 
Quarter 
Q1 0.981321 0.514219 1.153036 0.435163 
Q2 1.618982 1.541605 0.672721 -0.104411 
Q3 0.184519 0.937082 0.750445 1.361556 


In [70]: groups.aggregate([min, max]).round(2) (5) 
Out[70]: No1 No2 No3 No4 
min max min max min max min max 


Quarter 

Q1 -1.75 0.98 0.26 0.51 -0.46 1.15 -1.07 0.44 
Q2 -0.58 1.62 0.82 1.54 -0.44 0.67 -1.12 -0.10 
Q3 -0.76 0.18 0.06 0.94 0.22 0.75 -1.44 1.36 


Groups according to the Quarter column. 
Gives the number of rows in each group. 


Gives the mean per column. 


© © 8 8 


Gives the maximum value per column. 


Gives both the minimum and maximum values per column. 


Grouping can also be done with multiple columns. To this end, another column, 
indicating whether the month of the index date is odd or even, is introduced: 


In [71]: df['Odd_Even'] = ['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even', 
'Odd', 'Even', 'Odd'] 


In [72]: groups = df.groupby(['Quarter', 'Odd_Even']) 


In [73]: groups.size() 
Out[73]: Quarter Odd_Even 


Q1 Even 1 
Odd 2 
Q2 Even 2 


GroupBy Operations | 131 


Odd 
Q3 Even 
Odd 
dtype: int64 


groups[['No1', 'No4']].aggregate([sum, np.mean]) 


0.981321 
-1.939261 
1.035387 
-@.531280 
-0.326238 
-0.571834 


In [74]: 
Out[74]: 
Quarter Odd_Even 
Q1 Even 
Odd 
Q2 Even 
Odd 
Q3 Even 
Odd 
Complex Selection 


Often, data selection is accomplished by formulation of conditions on column values, 
and potentially combining multiple such conditions logically. Consider the following 


data set: 


In [75]: 


In [76]: df = pd.DataFrame(data, columns=['x', 'y']) (2) 


In [77]: 


In [78]: 
Out[78]: 


In [79]: 
Out[79]: 


data = 


df.info() @ 


<class 'pandas.core.frame.DataFrame'> 


BR 


mean 


0.981321 
-0.969631 
0.517693 
-0.531280 
-0.326238 
-0.285917 


RangeIndex: 10 entries, 0 to 9 
Data columns (total 2 columns): 


x 10 non-null float64 
y 10 non-null float64 


dtypes: float64(2) 


memory usage: 240.0 bytes 


df.head() © 
x 

0 1.189622 - 

1 -1.356399 

2 -0.544439 - 

3 0.007315 

4 1.299748 


ro f 
ROOrRR 


df.tail() @ 

x 
-0.983310 
-1.613579 
-1.188018 - 
-0.940046 
0.108863 


OANA MN 
' 
oo oro 


y 


- 690617 
- 232435 
- 668172 
- 612939 
- 733096 


y 
:357508 
- 470714 
- 549746 
827932 
. 507810 


No4 
sum 


.070043 
.182727 
. 946847 
. 118318 
. 443217 
- 905609 


np.random.standard_normal((10, 2)) (1) 


koe i ' 
OrPRrRPOOoOFrF 


mean 


. 070043 
. 091364 
- 473423 
. 118318 
. 443217 
-452805 
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@ ndarray object with standard normally distributed random numbers. 
@ DataFrame object with the same random numbers. 
© The first five rows via the head() method. 


@ The final five rows via the tail() method. 


The following code illustrates the application of Python’s comparison operators and 
logical operators on values in the two columns: 


f['x'] > 0.5 @ 
True 
False 
False 
False 


In [80]: d 
0 
1 
2 
3 
4 True 
5 
6 
Fi 
8 
9 


Out[80]: 


False 
False 
False 
False 
False 
Name: x, dtype: bool 


In [81]: (df['x'] > 0) & (df['y'] < 0) © 
Out[81]: 0 True 
1 False 
2 False 
3 True 
4 True 
S False 
6 False 
7 False 
8 False 
9 False 
dtype: bool 


In [82]: (df['x'] > 0) | (df['y'] <0) © 
Out[82]: 0 True 
1 True 
2 True 
3 True 
4 True 
5 False 
6 False 
Fi True 
8 True 
9 True 
dtype: bool 
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@ Check whether value in column x is greater than 0.5. 


© Check whether value in column x is positive and value in column y is negative. 


© Check whether value in column x is positive or value in column y is negative. 


Using the resulting Boolean Series objects, complex data (row) selection is straight- 
forward. Alternatively, one can use the query() method and pass the conditions as 


str objects: 


In [83]: 
Out[83]: 


In [84]: 
Out[84]: 


In [85]: 
Out[85]: 


In [86]: 
Out[86]: 


In [87]: 
Out[87]: 


df[df['x'] > 0] @ 
x y 
1.189622 -1.690617 
0.007315 -0.612939 
1.299748 -1.733096 
0.108863 0.507810 


wow RW OS 


df.query('x > 0') (1) 

x y 
© 1.189622 -1.690617 
3. 0.007315 -0.612939 
4 1.299748 -1.733096 
9 0.108863 0.507810 


df[(df['x'] > 0) & (df['y'] < 0)] @ 


x y 
© 1:189622 -1.690617 
3 0.007315 -0.612939 
4 1.299748 -1.733096 


df.query('x > 0 &y < 

x y 
© 1.189622 -1.690617 
3- 0.007315 -0.612939 
4 1.299748 -1.733096 


df[(df.x > 0) | (df.y 

x y 
1.189622 -1.690617 
-1.356399 -1.232435 
-0.544439 -0.668172 
0.007315 -0.612939 
.299748 -1.733096 
-1.188018 -0.549746 
-0.940046 -0.827932 
0.108863 0.507810 


WANRWNH OS 
Bb 


<0)] © 


@ Allrows for which the value in column x is greater than 0. 
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© All rows for which the value in column x is positive and the value in column y is 
negative. 


© All rows for which the value in column x is positive or the value in column y is 
negative (columns are accessed here via the respective attributes). 


Comparison operators can also be applied to complete DataFrame objects at once: 


In [88]: df >0 @ 

Out[88]: X y 
True False 
False False 
False False 
True False 
True False 
False True 
False True 
False False 
False False 
True True 


WANA NBRWNH © 


In [89]: df[df > 0] @ 


Out[89]: x y 
© 1.189622 NaN 
1 NaN NaN 
2 NaN NaN 
3 0.007315 NaN 
4 1.299748 NaN 
5 NaN 0.357508 
6 NaN 1.470714 
Fs NaN NaN 
8 NaN NaN 
9 0.108863 0.507810 


@ Which values in the DataFrame object are positive? 


@ Select all such values and put a NaN in all other places. 


Concatenation, Joining, and Merging 


This section walks through different approaches to combine two simple data sets in 
the form of DataFrame objects. The two simple data sets are: 


In [90]: df1 = pd.DataFrame(['100', '200', '300', '400'], 
index=['a', 'b', 'c', 'd'], 
columns=['A', ]) 


In [91]: df1 
Out[91]: A 
a 100 
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b 200 
c 300 
d 400 


In [92]: df2 = pd.DataFrame(['200', '150', '50'], 
index=['f', 'b', 'd'], 
columns=['B', ]) 


In [93]: df2 

Out[93]: B 
f 200 
b 150 
d 50 


Concatenation 


Concatenation or appending basically means that rows are added from one DataFrame 
object to another one. This can be accomplished via the append() method or via the 
pd.concat() function. A major consideration is how the index values are handled: 


In [94]: df1.append(df2, sort=False) 1] 
Out[94]: A B 
100 NaN 
200 NaN 
300 NaN 
400 NaN 
NaN 200 
NaN 150 
NaN 50 


arnhan o wa 


In [95]: df1.append(df2, ignore_index=True, sort=False) (2) 
Out[95]: A B 

© 100 NaN 
1 200 NaN 
2 300 NaN 
3 400 NaN 
4 NaN 200 
5 NaN 150 
6 NaN 50 


In [96]: pd.concat((df1, df2), sort=False) © 
Out[96]: A B 
100 NaN 
200 NaN 
300 NaN 
NaN 
NaN 200 
NaN 150 
NaN 50 


ac»rnhan o ow 
D 
© 
© 


In [97]: pd.concat((df1, df2), ignore_index=True, sort=False) (43 
Out[97]: A B 
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O 100 NaN 

1 200 NaN 

2 300 NaN 

3 400 NaN 

4 NaN 200 

5 NaN 150 

6 NaN 50 
@ Appends data from df2 to df1 as new rows. 
© Does the same but ignores the indices. 
© Has the same effect as the first append operation. 
© Has the same effect as the second append operation. 
Joining 


When joining the two data sets, the sequence of the DataFrame objects also matters 
but in a different way. Only the index values from the first DataFrame object are used. 
This default behavior is called a left join: 


In [98]: df1.join(dfz) © 
Out[98]: A B 
100 NaN 
200 150 
300 NaN 
400 50 


anor 


In [99]: df2.join(df1) @ 
Out[99]: B A 

f 200 NaN 

b 150 200 

d 50 400 


@ Index values of df1 are relevant. 


© Index values of df2 are relevant. 


There are a total of four different join methods available, each leading to a different 
behavior with regard to how index values and the corresponding data rows are 
handled: 


In [100]: df1.join(df2, how='left') © 
Out[100]: A B 

a 100 NaN 

b 200 150 

c 300 NaN 

d 400 50 
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In [101]: df1.join(df2, how='right') (2) 
Out[101]: A B 

f NaN 200 

b 200 150 

d 400 50 


In [102]: df1.join(df2, how='inner') © 


Out[102]: A B 
b 200 150 
d 400 50 


In [103]: df1.join(df2, how='outer') (4) 
Out[103]: A B 
100 NaN 
200 150 
300 NaN 
400 50 
NaN 200 


>on ga 


Left join is the default operation. 


Right join is the same as reversing the sequence of the DataFrame objects. 


© 8 9 


Inner join only preserves those index values found in both indices. 


© Outer join preserves all index values from both indices. 


A join can also happen based on an empty DataFrame object. In this case, the col- 
umns are created sequentially, leading to behavior similar to a left join: 


In [104]: df = pd.DataFrame() 


In [105]: df['A'] 


dfa['a'] O 


In [106]: df 
Out[106]: A 


In [107]: df['B'] = df2 @ 


In [108]: df 

Out[108]: A B 
100 NaN 
200 150 
300 NaN 
400 50 


ang ü 


© df1as first column A. 
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@ df2as second column B. 


Making use of a dictionary to combine the data sets yields a result similar to an outer 
join since the columns are created simultaneously: 


In [109]: df = pd.DataFrame({'A': dfi['A'], 'B': df2['B']}) 1] 


In [110]: df 

Out[110]: A B 
100 NaN 
200 150 
300 NaN 
400 50 
NaN 200 


> & 7) Oo oO 


@ The columns of the DataFrame objects are used as values in the dict object. 


Merging 


While a join operation takes place based on the indices of the DataFrame objects to be 
joined, a merge operation typically takes place on a column shared between the two 
data sets. To this end, a new column C is added to both original DataFrame objects: 
In [111]: c = pd.Series([250, 150, 50], index=['b', 'd', 'c']) 
dfi["c'] =€ 
df2['c'] = 


l 
Fy 


In [112]: df1 
Out[112]: A c 


an gou 
w 
© 
© 
uw 
© 
© 


In [113]: df2 

Out[113]: B č 
f 200 NaN 
b 150 250.0 
d 50 150.0 


By default, the merge operation in this case takes place based on the single shared 
column C. Other options are available, however, such as an outer merge: 


In [114]: pd.merge(df1, df2) (13 
Out[114]: A € B 

© 100 NaN 200 

1 200 250.0 150 

2 400 150.0 50 


In [115]: pd.merge(df1, df2, on='C') (1) 
Out[115]: A G B 
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In [116]: 
Out[116]: 


BR 


pd. 


WNrF © 


100 NaN 200 
200 250.0 150 
400 150.0 50 


merge(dfi, df2, 

A Ç B 
100 NaN 200 
200 250.0 150 
300 50.0 NaN 
400 150.0 50 


how='outer') (2) 


@ The default merge on column C. 


© An outer merge is also possible, preserving all data rows. 


Many more types of merge operations are available, a few of which are illustrated in 
the following code: 


In [117]: 
Out[117]: 


In [118]: 
Out[118]: 


In [119]: 
Out[119]: 


In [120]: 
Out[120]: 


In [121]: 
Out[121]: 


In [122]: 
Out[122]: 


pd.merge(df1, df2, left_on='A', right_on='B') 


A Cx B 
200 250.0 200 


-merge(df1, df2, 


A Cx B 
100 NaN NaN 
200 250.0 200 
300 50.0 NaN 
400 150.0 NaN 
NaN NaN 150 
NaN NaN 50 


-merge(df1, df2, 


A Cx B 
200 250.0 150 
400 150.0 50 


-merge(df1, df2, 


A € B 
100 NaN 200 
200 250.0 150 
400 150.0 50 


-merge(df1, df2, 


A a B 
100 NaN 200 
200 250.0 150 
400 150.0 50 


-merge(df1, df2, 


A E B 
200 250.0 150 
400 150.0 50 


C_y 
NaN 


left_on='A', right_on='B', how='outer') 


Cy 
NaN 


NaN 
250.0 
150.0 


Lleft_index=True, right_index=True) 
Cy 

250.0 

150.0 


on='C', left_index=True) 


on='C', right_index=True) 


on='C', left_index=True, right_index=True) 
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Performance Aspects 


Many examples in this chapter illustrate that there are often multiple options to ach- 
ieve the same goal with pandas. This section compares such options for adding up 
two columns element-wise. First, the data set, generated with NumPy: 


In [123]: data = np.random.standard_normal((1000000, 2)) (1) 


In [124]: data.nbytes (13 
Out[124]: 16000000 


In [125]: df = pd.DataFrame(data, columns=['x', 'y']) (2) 


In [126]: df.info() @ 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 1000000 entries, 0 to 999999 
Data columns (total 2 columns): 
x 1000000 non-null float64 
y 1000000 non-null float64 
dtypes: float64(2) 
memory usage: 15.3 MB 


@ The ndarray object with random numbers. 


@ The DataFrame object with the random numbers. 
Second, some options to accomplish the task at hand with performance values: 


In [127]: %time res = df['x'] + df['y'] (1) 
CPU times: user 7.35 ms, sys: 7.43 ms, total: 14.8 ms 
Wall time: 7.48 ms 


In [128]: res[:3] 

Out[128]: © 0.387242 
1 -0.969343 
2 -0.863159 
dtype: float64 


In [129]: %time res = df.sum(axis=1) (2) 
CPU times: user 130 ms, sys: 30.6 ms, total: 161 ms 
Wall time: 101 ms 


In [130]: res[:3] 

Out[130]: © 0.387242 
1 -0.969343 
2 -0.863159 
dtype: float64 


In [131]: %time res = df.values.sum(axis=1) © 
CPU times: user 50.3 ms, sys: 2.75 ms, total: 53.1 ms 
Wall time: 27.9 ms 
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In [132]: 
Out[132]: 


In [133]: 


In [134]: 
Out[134]: 


In [135]: 


In [136]: 
Out[136]: 


o © 8 86 


object. 


res[:3] 
array([ 0.3872424 , -0.96934273, -0.86315944]) 


%time res = np.sum(df, axis=1) (4) 
CPU times: user 127 ms, sys: 15.1 ms, total: 142 ms 
Wall time: 73.7 ms 


res[:3] 

0 0.387242 
1 -0.969343 
2 -0.863159 


dtype: float64 


%time res = np.sum(df.values, axis=1) (5) 
CPU times: user 49.3 ms, sys: 2.36 ms, total: 51.7 ms 
Wall time: 26.9 ms 


res[:3] 
array([ 0.3872424 , -0.96934273, -0.86315944]) 


Working with the columns (Series objects) directly is the fastest approach. 
This calculates the sums by calling the sum() method on the DataFrame object. 
This calculates the sums by calling the sum() method on the ndarray object. 


This calculates the sums by using the function np.sum() on the DataFrame 


© This calculates the sums by using the function np.sum() on the ndarray object. 


Finally, two more options which are based on the methods eval() and apply(), 


respectively:' 


In [137]: 


In [138]: 
Out[138]: 


In [139]: 


%time res = df.eval('x + y') (1) 
CPU times: user 25.5 ms, sys: 17.7 ms, total: 43.2 ms 
Wall time: 22.5 ms 


res[:3] 

0 0.387242 
1 -0.969343 
Pa -0.863159 


dtype: float64 


%time res = df.apply(lambda row: row['x'] + row['y'], axis=1) e 
CPU times: user 19.6 s, sys: 83.3 ms, total: 19.7 s 


1 The application of the eval() method requires the numexpr package to be installed. 
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Wall time: 19.9 s 


In [140]: res[:3] 

Out[140]: 0 0.387242 
1 -0.969343 
2 -0.863159 
dtype: float64 


@ eval() is a method dedicated to evaluation of (complex) numerical expressions; 
columns can be directly addressed. 


© The slowest option is to use the apply() method row-by-row; this is like looping 
on the Python level over all rows. 


Choose Wisely 


pandas often provides multiple options to accomplish the same 
goal. If unsure of which to use, compare the options to verify that 
the best possible performance is achieved when time is critical. In 
this simple example, execution times differ by orders of magnitude. 


Conclusion 


pandas is a powerful tool for data analysis and has become the central package in the 
so-called PyData stack. Its DataFrame class is particularly suited to working with tab- 
ular data of any kind. Most operations on such objects are vectorized, leading not 
only—as in the NumPy case—to concise code but also to high performance in general. 
In addition, pandas makes working with incomplete data sets convenient (which is 
not the case with NumPy, for instance). pandas and the DataFrame class will be central 
in many later chapters of the book, where additional features will be used and intro- 
duced when necessary. 


Further Reading 


pandas is an open source project with both online documentation and a PDF version 
available for download.’ The website provides links to both, and additional resources: 


¢ http://pandas.pydata.org/ 


2 At the time of this writing, the PDF version has a total of more than 2,500 pages. 
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As for NumPy, recommended references for pandas in book form are: 


e McKinney, Wes (2017). Python for Data Analysis. Sebastopol, CA: O’Reilly. 


e VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 
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CHAPTER 6 
Object-Oriented Programming 


The purpose of software engineering is to control complexity, not to create it. 


—Pamela Zave 


Object-oriented programming (OOP) is one of the most popular programming para- 
digms today. Used in the right way, it provides a number of advantages compared to, 
for example, procedural programming. In many cases, OOP seems to be particularly 
suited for financial modeling and implementing financial algorithms. However, there 
are also many critics, voicing their skepticism about single aspects of OOP or even 
the paradigm as a whole. This chapter takes a neutral stance, in that OOP is consid- 
ered an important tool that might not be the best one for every single problem, but 
that should be at the disposal of programmers and quants working in finance. 


With OOP, some new language comes along. The most important terms for the pur- 
poses of this book and chapter are (more follow later): 


Class 
An abstract definition of a certain type of objects. For example, a human being. 


Object 
An instance of a class. For example, Sandra. 


Attribute 
A feature of the class (class attribute) or of an instance of the class (instance 
attribute). For example, being a mammal, being male or female, or color of the 
eyes. 


Method 
An operation that the class or an instance of the class can implement. For exam- 
ple, walking. 
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Parameters 
Input taken by a method to influence its behavior. For example, three steps. 


Instantiation 
The process of creating a specific object based on an abstract class. 


Translated into Python code, a simple class implementing the example of a human 
being might look as follows: 
In [1]: class HumanBeing(object): (1) 
def __init_(self, first_name, eye_color): (2) 
self.first_name = first_name 
self.eye_color = eye_color 
self.position = 0 (5) 
def walk_steps(self, steps): Q 
self.position += steps (7) 


Class definition statement; self refers to the current instance of the class. 
Special method called during instantiation. 

First name attribute initialized with parameter value. 

Eye color attribute initialized with parameter value. 


Position attribute initialized with 0. 


© © © O 8 8 


Method definition for walking with steps as parameter. 


Code that changes the position given the steps value. 
Based on the class definition, a new Python object can be instantiated and used: 


In [2]: Sandra = HumanBeing('Sandra', 'blue') (1) 


In [3]: Sandra.first_name (2) 
Out[3]: 'Sandra' 


In [4]: Sandra.position (2) 
Out[4]: 0 


In [5]: Sandra.walk_steps(5) © 


In [6]: Sandra.position (4) 
Out[6]: 5 


@ The instantiation. 


© Accessing attribute values. 
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© Calling the method. 


© Accessing the updated position value. 
There are several human aspects that might speak for the use of OOP: 


Natural way of thinking 
Human thinking typically evolves around real-world or abstract objects, like a 
car or a financial instrument. OOP is suited to modeling such objects with their 
characteristics. 


Reducing complexity 
Via different approaches, OOP helps to reduce the complexity of a problem or 
algorithm and to model it feature-by-feature. 


Nicer user interfaces 
OOP allows in many cases for nicer user interfaces and more compact code. This 
becomes evident, for example, when looking at the NumPy ndarray class or pan 
das DataFrame class. 


Pythonic way of modeling 
Independent of the pros and cons of OOP, it is simply the dominant paradigm in 
Python. This is where the saying “everything is an object in Python” comes from. 
OOP also allows the programmer to build custom classes whose instances behave 
like every other instance of a standard Python class. 


There are also several technical aspects that might speak for OOP: 


Abstraction 
The use of attributes and methods allows building abstract, flexible models of 
objects, with a focus on what is relevant and neglecting what is not needed. In 
finance, this might mean having a general class that models a financial instru- 
ment in abstract fashion. Instances of such a class would then be concrete finan- 
cial products, engineered and offered by an investment bank, for example. 


Modularity 
OOP simplifies breaking code down into multiple modules which are then linked 
to form the complete codebase. For example, modeling a European option on a 
stock could be achieved by a single class or by two classes, one for the underlying 
stock and one for the option itself. 


Inheritance 
Inheritance refers to the concept that one class can inherit attributes and meth- 
ods from another class. In finance, starting with a general financial instrument, 
the next level could be a general derivative instrument, then a European option, 


Object-Oriented Programming | 147 


then a European call option. Every class might inherit attributes and methods 
from class(es) on a higher level. 


Aggregation 
Aggregation refers to the case in which an object is at least partly made up of 
multiple other objects that might exist independently. A class modeling a Euro- 
pean call option might have as attributes other objects for the underlying stock 
and the relevant short rate for discounting. The objects representing the stock 
and the short rate can be used independently by other objects as well. 


Composition 
Composition is similar to aggregation, but here the single objects cannot exist 
independently of each other. Consider a custom-tailored interest rate swap with 
a fixed leg and a floating leg. The two legs do not exist independently of the swap 
itself. 


Polymorphism 
Polymorphism can take on multiple forms. Of particular importance in a Python 
context is what is called duck typing. This refers to the fact that standard opera- 
tions can be implemented on many different classes and their instances without 
knowing exactly what object one is dealing with. For a class of financial instru- 
ments this might mean that one can call a method get_current_price() inde- 
pendent of the specific type of the object (stock, option, swap). 


Encapsulation 

This concept refers to the approach of making data within a class accessible only 
via public methods. A class modeling a stock might have an attribute cur 
rent_stock_price. Encapsulation would then give access to the attribute value 
via a method get_current_stock_price() and would hide the data from the 
user (i.e., make it private). This approach might avoid unintended effects by sim- 
ply working with and possibly changing attribute values. However, there are lim- 
its as to how data can be made private in a Python class. 


On a somewhat higher level, many of these aspects can be summarized by two gener- 
als goals in software engineering: 


Reusability 
Concepts like inheritance and polymorphism improve code reusability and 
increase the efficiency and productivity of the programmer. They also simplify 
code maintenance. 


Nonredundancy 
At the same time, these approaches allow one to build almost nonredundant 
code, avoiding double implementation effort and reducing debugging and testing 
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effort as well as maintenance effort. They might also lead to a smaller overall 
codebase. 


This chapter is organized as follows: 


“A Look at Python Objects” on page 149 
This section takes a look at some Python objects through the lens of OOP. 


“Basics of Python Classes” on page 154 
This section introduces central elements of OOP in Python and uses financial 
instruments and portfolio positions as major examples. 


“Python Data Model” on page 159 
This section discusses important elements of the Python data model and roles 
that certain special methods play. 


A Look at Python Objects 


Let’s start by taking a brief look at some standard objects encountered in previous 
chapters through the eyes of an OOP programmer. 


int 
To start simple, consider an integer object. Even with such a simple Python object, 
the major OOP features are present: 


In [7]:n=5 © 


In [8]: type(n) (2) 
Out[8]: int 


In [9]: n.numerator © 
Out[9]: 5 


In [10]: n.bit_length() (4) 
Out[10]: 3 


In [11]: n+ n (5) 
Out[11]: 10 


In [12]: 2*n © 
Out[12]: 10 


In [13]: n.__sizeof__() @ 
Out[13]: 28 


@ Newinstance n. 


© Type of the object. 
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An attribute. 
A method. 
Applying the + operator (addition). 


Applying the * operator (multiplication). 


© © O 6 Ọ 


Calling the special method __sizeof__() to get the memory usage in bytes.’ 


list 
list objects have some more methods but basically behave the same way: 
In [14]: l = [1, 2, 3, 4] @ 


In [15]: type(l) (2) 
Out[15]: list 


In [16]: 1[0] © 
Out[16]: 1 


In [17]: l.append(10) (4) 


In [18]: 1+1 © 
Out[18]: Li; 2, 3; 4, 16, 2, 2, 3, 4, 20] 


In [19]: 2*1 O 
Out [49]: L1; 2, 354, 16, 2, 2; 3; 4; 10) 


In [20]: sum(l) @ 
Out[20]: 20 


In [21]: 1l.__sizeof__() (8) 
Out[21]: 104 


New instance l. 
Type of the object. 


Selecting an element via indexing. 


o © 8 8 


A method. 


1 Special attributes and methods in Python are characterized by double leading and trailing underscores as in 
XYZ__().n.__sizeof__() returns the size of the Python object n in bytes. 
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Applying the + operator (concatenation). 
Applying the * operator (concatenation). 


Applying the standard Python function sum(). 


O © O Ọ 


Calling the special method __sizeof__() to get the memory usage in bytes. 


ndarray 


int and list objects are standard Python objects. The NumPy ndarray object is a 
“custom-made” object from an open source package: 


In [22]: import numpy as np 1) 


In [23]: a = np.arange(16).reshape((4, 4)) (2) 


In [24]: a @ 

Out[24]: array([[ 0, 1, 2, 3], 
[ 4; Sy. 6s. Fl 
[ 8, 9, 10, 11], 
[12, 13, 14, 1511) 


In [25]: type(a) © 
Out[25]: numpy.ndarray 


@ Importing numpy. 
© A newinstance a. 


© Type of the object. 


Although the ndarray object is not a standard object, it behaves in many cases as if it 
were one—thanks to the Python data model, as explained later in this chapter: 


In [26]: a.nbytes (1) 
Out[26]: 128 


In [27]: a.sum() (2) 
Out[27]: 120 


In [28]: a.cumsum(axis=0) © 

Out[28]: array([[ 0, 1, 2, 3], 
[ 4, 6, 8, 10], 
[12, 15, 18, 21], 
[24, 28, 32, 36]]) 


In [29]: at+a (4) 
Out[29]: array([[ ©, 2, 4, 6], 
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In [30]: 
Out[30]: 


In [31]: 
Out[31]: 


In [32]: 
Out[32]: 


In [33]: 
Out[33]: 


© © O O O © 8 8 


[ 8, 10, 12, 14], 
[16,; 18, 20, 221; 
[24, 26, 28, 30]]) 


2*a © 
array([[ 0, 2, 4, 6], 
[ 8, 10, 12, 14], 
[16, 18, 20, 22], 
[24, 26, 28, 30]]) 


sum(a) 6] 
array([24, 28, 32, 36]) 


np.sum(a) (7) 
120 


a.__sizeof__() © 
112 


An attribute. 

A method (aggregation). 

A method (no aggregation). 
Applying the + operator (addition). 
Applying the * operator (multiplication). 
Applying the standard Python function sum(). 
Applying the NumPy universal function np.sum(). 


Calling the special method __sizeof__() to get the memory usage in bytes. 


DataFrame 


Finally, a quick look at the pandas DataFrame object, which behaves similarly to the 
ndarray object. First, the instantiation of the DataFrame object based on the ndarray 


object: 


In [34]: import pandas as pd (13 


In [35]: df = pd.DataFrame(a, columns=list('abcd')) (2) 


In [36]: 
Out[36]: 


type(df) © 
pandas.core.frame.DataFrame 
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@ Importing pandas. 
© Anew instance df. 


© Type ofthe object. 
Second, a look at attributes, methods, and operations: 


In [37]: df.columns (1) 
Out[37]: Index(['a', 'b', 'c', 'd'], dtype='object') 


In [38]: df.sum() @ 
Out[38]: a 24 


b 28 
C 32 
d 36 


dtype: int64 


In [39]: df.cumsum() © 
Out[39]: a b c d 


0 

1 4 6 8 10 

2 12 15 18 21 

3 24 28 32 36 
In [40]: df + df O 
Out[40]: a b c d 


wne Oe 
œ 
He 
© 
m. 
N 
m. 
D 


N 
* 
a 
+ 
© 


In [41]: 
Out[41]: a b c d 


wne o 
œ 
= 
© 
m 
N 
m. 
Fi 


In [42]: np.sum(df) Q 
Out[42]: a 24 


b 28 
c 32 
d 36 


dtype: int64 


In [43]: df.__sizeof__() (7) 
Out[43]: 208 


@ Anattribute. 
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A method (aggregation). 

A method (no aggregation). 

Applying the + operator (addition). 
Applying the * operator (multiplication). 


Applying the NumPy universal function np.sum(). 


© © O © 8 © 


Calling the special method __sizeof__() to get the memory usage in bytes. 


Basics of Python Classes 


This section covers major concepts and the concrete syntax to make use of OOP in 
Python. The context now is about building custom classes to model types of objects 
that cannot be easily, efficiently, or properly modeled by existing Python object types. 
Throughout, the example of a financial instrument is used. 


Two lines of code suffice to create a new Python class: 


In [44]: class FinancialInstrument(object): (1) 
pass 


In [45]: fi = FinancialInstrument() © 


In [46]: type(fi) (4) 
Out[46]: __main__.FinancialInstrument 


In [47]: fi © 
Out[47]: <__main__.FinancialInstrument at 0x116767278> 


In [48]: fi.__str__() 5] 
Out[48]: '<__main__.FinancialInstrument object at 0x116767278>' 


In [49]: fi.price = 100 © 


In [50]: fi.price 6] 
Out[50]: 100 


© Class definition statement.” 


© Some code; here simply the pass keyword. 


2 Camel-case naming for classes is recommended. However, if there is no ambiguity, lowercase or snake case 
(as in financial_instrument) can also be used. 
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A new instance of the class named aapl. 


The type of the object. 


Every Python object comes with certain “special” attributes and methods (from 


object); here, the special method to retrieve the string representation is called. 


© So-called data attributes—in contrast to regular attributes—can be defined on the 
fly for every object. 


An important special method is __init__, which gets called during every instantia- 
tion of an object. It takes as parameters the object itself (self, by convention) and 
potentially multiple others: 


In [51]: 


In [52]: 
Out[52]: 


In [53]: 


In [54]: 
Out[54]: 


In [55]: 
Out[55]: 


In [56]: 


In [57]: 
Out[57]: 


© © 6 O 8 8 


class FinancialInstrument(object): 
author = 'Yves Hilpisch' 
def __init__(self, symbol, price): (2) 
self.symbol = symbol 
self.price = price 


FinancialInstrument.author @ 
'Yves Hilpisch' 


aapl = FinancialInstrument('AAPL', 100) (4) 


aapl.symbol (5) 
'AAPL' 


aapl.author (6) 
'Yves Hilpisch' 


aapl.price = 105 (7) 


aapl.price (7) 
105 


Definition of a class attribute (inherited by every instance). 

The special method _init_ called during initialization. 
Definition of the instance attributes (individual to every instance). 
A new instance of the class named fi. 

Accessing an instance attribute. 


Accessing a class attribute. 


Basics of Python Classes 
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@ Changing the value of an instance attribute. 


Prices of financial instruments change regularly, but the symbol of a financial instru- 
ment probably does not change. To introduce encapsulation to the class definition, 
two methods, get_price() and set_price(), might be defined. The code that fol- 
lows additionally inherits from the previous class definition (and not from object 


anymore): 


In [58]: 


In [59]: 


In [60]: 
Out[60]: 


In [61]: 


In [62]: 
Out[62]: 


In [63]: 
Out[63]: 


O © O O 6 O 8 8 


class FinancialInstrument(FinancialInstrument): (1) 
def get_price(self): 
return self.price (2) 
def set_price(self, price): © 
self.price = price 


fi = FinancialInstrument('AAPL', 100) (5) 


fi.get_price() Q 
100 


fi.set_price(105) (7) 


fi.get_price() Q 
105 


fi.price 8] 
105 


Class definition via inheritance from previous version. 

Defines the get_price() method. 

Defines the set_price() method ... 

... and updates the instance attribute value given the parameter value. 
A new instance based on the new class definition named fi. 

Calls the get_price() method to read the instance attribute value. 
Updates the instance attribute value via set_price(). 


Direct access to the instance attribute. 


Encapsulation generally has the goal of hiding data from the user working with a 
class. Adding getter and setter methods is one part of achieving this goal. However, 
this does not prevent the user from directly accessing and manipulating instance 
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attributes. This is where private instance attributes come into play. They are defined 
by two leading underscores: 
In [64]: class FinancialInstrument(object): 
def __ init__(self, symbol, price): 
self.symbol = symbol 
self.__price = price (1) 
def get_price(self): 
return self.__price 
def set_price(self, price): 
self.__price = price 


In [65]: fi = FinancialInstrument('AAPL', 100) 


In [66]: fi.get_price() (2) 
Out[66]: 100 


In [67]: fi.__price © 


AttributeError Traceback (most recent call last) 
<ipython-input-67-bd62f6cadb79> in <module> 
----> 1 fi.__price 


AttributeError: 'FinancialInstrument' object has no attribute '_ price’ 


In [68]: fi._FinancialInstrument__price (4) 
Out[68]: 100 


In [69]: fi._FinancialInstrument__price = 105 (4) 
In [70]: fi.set_price(100) (5) 
Price is defined as a private instance attribute. 


The method get_price() returns its value. 


Trying to access the attribute directly raises an error. 


o © 8 8 


If the class name is prepended with a single leading underscore, direct access and 
manipulation are still possible. 


© 


Sets the price back to its original value. 
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Encapsulation in Python 


Although encapsulation can basically be implemented for Python 
classes via private instance attributes and respective methods deal- 
ing with them, the hiding of data from the user cannot be fully 
enforced. In that sense, it is more an engineering principle in 
Python than a technical feature of Python classes. 


Consider another class that models a portfolio position of a financial instrument. 
With the two classes aggregation as a concept is easily illustrated. An instance of the 
PortfolioPosition class takes an instance of the FinancialInstrument class as an 
attribute value. Adding an instance attribute, such as position_size, one can then 
calculate, for instance, the position value: 


In [71]: 


In [72]: 


In [73]: 
Out[73]: 


In [74]: 
Out[74]: 


In [75]: 
Out[75]: 


In [76]: 


In [77]: 
Out[77]: 


class PortfolioPosition(object): 
def __init_(self, financial_instrument, position_size): 
self.position = financial_instrument 
self.__position_size = position_size (2) 
def get_position_size(self): 
return self.__position_size 
def update_position_size(self, position_size): 
self.__position_size = position_size 
def get_position_value(self): 
return self.__position_size * \ 
self.position.get_price() © 


pp = PortfolioPosition(fi, 10) 


pp.get_position_size() 
10 


pp.get_position_value() © 
1000 


pp.position.get_price() (4) 
100 


pp.position.set_price(105) (5 


pp.get_position_value() (6) 
1050 


@ An instance attribute based on an instance of the FinancialInstrument class. 


@ A private instance attribute of the PortfolioPosition class. 


® Calculates the position value based on the attributes. 
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© Methods attached to the instance attribute object can be accessed directly (could 
be hidden as well). 


© Updates the price of the financial instrument. 


© Calculates the new position value based on the updated price. 


Python Data Model 


The examples in the previous section highlighted some aspects of the so-called 
Python data or object model. The Python data model allows you to design classes that 
consistently interact with basic language constructs of Python. Among others, it sup- 
ports (see Ramalho (2015), p. 4) the following tasks and constructs: 


e Iteration 

e Collection handling 

e Attribute access 

e Operator overloading 

e Function and method invocation 

e Object creation and destruction 

e String representation (e.g., for printing) 


e Managed contexts (i.e., with blocks) 


Since the Python data model is so important, this section is dedicated to an example 
(from Ramalho (2015), with slight adjustments) that explores several aspects of it. It 
implements a class for one-dimensional, three-element vectors (think of vectors in 
Euclidean space). First, the special method __init__: 


In [78]: class Vector(object): 
def __init_(self, x=0, y=0, z=0): (1) 


self.x = x 
self.y = y (1) 
self.z=z © 


In [79]: v = Vector(1, 2, 3) @ 


In [80]: v ® 
Out[80]: <__main__.Vector at 0x1167789e8> 


© Three preinitialized instance attributes (think three-dimensional space). 


@ A new instance of the class named v. 
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© The default string representation. 


The special method __repr__allows the definition of custom string representations: 


In [81]: 


In [82]: 


In [83]: 
Out[83]: 


In [84]: 


class Vector(Vector): 
def __repr__(self): 


return 'Vector(%r, %r, %r)' % (self.x, self.y, self.z) 


v = Vector(1, 2, 3) 


v © 
Vector(1, 2, 3) 


print(v) (1) 
Vector(1, 2, 3) 


The new string representation. 


abs() and bool() are two standard Python functions whose behavior on the Vector 
class can be defined via the special methods __abs__ and __bool__: 


In [85]: 


In [86]: 


In [87]: 
Out[87]: 


In [88]: 
Out[88]: 


In [89]: 


In [90]: 
Out [90]: 


In [91]: 
Out[91]: 


In [92]: 
Out[92]: 


class Vector(Vector): 
def __abs__ (self): 
return (self.x ** 2 + 


self.y ** 2 + 


Selif.g #*° 2). ** 0,5 1) 


def __bool__(self): 
return bool(abs(self)) 


v = Vector(1, 2, -1) (2) 


abs(v) 
2.449489742783178 


bool(v) 
True 


v = Vector() © 


v 8 
Vector(0, 0, 0) 


abs(v) 
0.0 


bool(v) 
False 


Returns the Euclidean norm given the three attribute values. 


A new Vector object with nonzero attribute values. 


160 


| Chapter 6: Object-Oriented Programming 


© A new Vector object with zero attribute values only. 


As shown multiple times, the + and * operators can be applied to almost any Python 
object. The behavior is defined through the special methods __add__ and __mul__ 


In [93]: class Vector(Vector): 
def __add__(self, other): 
= self.x + other.x 
= self.y + other.y 
= self.z + other.z 
return Vector(x, y, z) (1) 


a4 x || 


def __mul__(self, scalar): 
return Vector(self.x * scalar, 
self.y * scalar, 
self.z * scalar) 1) 


In [94]: v = Vector(1, 2, 3) 


In [95]: v + Vector(2, 3, 4) 
Out[95]: Vector(3, 5, 7) 


In [96]: v * 2 
Out[96]: Vector(2, 4, 6) 


@ In this case, each special method returns an object of its own kind. 


Another standard Python function is len(), which gives the length of an object in 
number of elements. This function accesses the special method __len__ when called 
on an object. On the other hand, the special method __getitem__ makes indexing via 
the square bracket notation possible: 


In [97]: class Vector(Vector): 
def __len__(self): 
return 3 © 


def __getitem__(self, i): 
if i in [0, -3]: return self.x 
elif i in [1, -2]: return self.y 
elif i in [2, -1]: return self.z 
else: raise IndexError('Index out of range.') 


In [98]: v = Vector(1, 2, 3) 


In [99]: len(v) 
Out [99]: 3 


In [100]: v[0] 
Out[100]: 1 


In [101]: v[-2] 
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Out[101]: 2 
In [102]: v[3] 


IndexError Traceback (most recent call last) 
<ipython-input-102-f998c57dcc1le> in <module> 
----> 1 v[3] 


<ipython-input-97-bOca25eef7b3> in __getitem__(self, i) 


7 elif i in [1, -2]: return self.y 
8 elif i in [2, -1]: return self.z 
----> 9 else: raise IndexError('Index out of range.') 


IndexError: Index out of range. 


@ Allinstances of the Vector class have a length of three. 


Finally, the special method __iter__ defines the behavior during iterations over ele- 
ments of an object. An object for which this operation is defined is called iterable. For 
instance, all collections and containers are iterable: 


In [103]: class Vector(Vector): 
def __iter__(self): 
for i in range(len(self)): 
yield self[i] 


In [104]: v = Vector(1, 2, 3) 


In [105]: for i in range(3): (1) 
print(v[i]) (1) 
1 
2 
3 


In [106]: for coordinate in v: (2) 
print(coordinate) (2) 

1 

Z 

3 


@ Indirect iteration using index values (via__getitem__). 


© Direct iteration over the class instance (using __iter__). 
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Enhancing Python 


The Python data model allows the definition of Python classes that 
interact with standard Python operators, functions, etc., seamlessly. 
This makes Python a rather flexible programming language that 
can easily be enhanced by new classes and types of objects. 


As a summary, the following section provides the Vector class definition in a single 
code block. 


The Vector Class 


In [107]: class Vector(object): 
def __ init__(self, x=0, y=0, z=0): 


self.x = x 
self.y = y 
self.z =z 


def __repr__(self): 
return 'Vector(%r, %r, %r)' % (self.x, self.y, self.z) 


def __abs__(self): 
return (self.x ** 2 + self.y ** 2 + self.z ** 2) ** 0.5 


def __bool_ (self): 
return bool(abs(self)) 


def __add__(self, other): 
x = self.x + other.x 
y = self.y + other.y 
z = self.z + other.z 
return Vector(x, y, Zz) 


def _mul__(self, scalar): 
return Vector(self.x * scalar, 
self.y * scalar, 
self.z * scalar) 


def __len__(self): 
return 3 


def __getitem__(self, i): 
if i in [0, -3]: return self.x 
elif i in [1, -2]: return self.y 
elif i in [2, -1]: return self.z 
else: raise IndexError('Index out of range.') 


def __iter__(self): 
for i in range(len(self)): 
yield self[i] 
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Conclusion 


This chapter introduces notions and approaches from object-oriented programming, 
both theoretically and through Python examples. OOP is one of the main program- 
ming paradigms used in Python. It not only allows for the modeling and implemen- 
tation of rather complex applications, but also allows one to create custom objects 
that behave like standard Python objects due to the flexible Python data model. 
Although there are many critics who argue against OOP, it is safe to say that it pro- 
vides the Python programmer and quant with powerful tools that are helpful when a 
certain degree of complexity is reached. The derivatives pricing package developed 
and discussed in Part V presents such a case where OOP seems the only sensible pro- 
gramming paradigm to deal with the inherent complexities and requirements for 
abstraction. 


Further Resources 


The following are valuable online resources about OOP in general and Python pro- 
gramming and OOP in particular: 


e Lecture Notes on Object-Oriented Programming 


e Object-Oriented Programming in Python 


A great resource in book form about Python object orientation and the Python data 
model is: 


e Ramalho, Luciano (2016). Fluent Python. Sebastopol, CA: O'Reilly. 
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PART Ill 


Financial Data Science 


This part of the book is about basic techniques, approaches, and packages for finan- 
cial data science. Many topics (such as visualization) and many packages (such as 
scikit-learn) are fundamental for data science with Python. In that sense, this part 
equips the quants and financial analysts with the Python tools they need to become 
financial data scientists. 


Like in Part II, the chapters are organized according to topics such that they can each 
be used as a reference for the topic of interest: 


Chapter 7 discusses static and interactive visualization with matplotlib and 
plotly. 


Chapter 8 is about handling financial time series data with pandas. 
Chapter 9 focuses on getting input/output (I/O) operations right and fast. 
Chapter 10 is all about making Python code fast. 

Chapter 11 focuses on frequently required mathematical tools in finance. 
Chapter 12 looks at using Python to implement methods from stochastics. 


Chapter 13 is about statistical and machine learning approaches. 


CHAPTER 7 
Data Visualization 


Use a picture. It’s worth a thousand words. 


—Arthur Brisbane (1911) 


This chapter is about the basic visualization capabilities of the matplotlib and 
plotly packages. 


Although there are more visualization packages available, matplotlib has established 
itself as the benchmark and, in many situations, a robust and reliable visualization 
tool. It is both easy to use for standard plots and flexible when it comes to more com- 
plex plots and customizations. In addition, it is tightly integrated with NumPy and 
pandas and the data structures they provide. 


matpLlotlib only allows for the generation of plots in the form of bitmaps (for exam- 
ple, in PNG or JPG format). On the other hand, modern web technologies—based, 
for example, on the Data-Driven Documents (D3.js) standard—allow for nice inter- 
active and also embeddable plots (interactive, for example, in that one can zoom in to 
inspect certain areas in greater detail). A package that makes it convenient to create 
such D3.js plots with Python is plotly. A smaller additional library, called 
Cufflinks, tightly integrates plotly with pandas DataFrame objects and allows for 
the creation of popular financial plots (such as candlestick charts). 


This chapter mainly covers the following topics: 


“Static 2D Plotting” on page 168 
This section introduces matplotlib and presents a selection of typical 2D plots, 
from the most simple to some more advanced ones with two scales or different 
subplots. 
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“Static 3D Plotting” on page 191 
Based on matplotlib, a selection of 3D plots useful for certain financial applica- 
tions are presented in this section. 


“Interactive 2D Plotting” on page 195 
This section introduces plotly and Cufflinks to create interactive 2D plots. 
Making use of the QuantFigure feature of Cufflinks, this section is also about 
typical financial plots used, for example, in technical stock analysis. 


This chapter cannot be comprehensive with regard to data visualization with Python, 
matplotlib, or plotly, but it provides a number of examples for the basic and 
important capabilities of these packages for finance. Other examples are also found in 
later chapters. For instance, Chapter 8 shows in more depth how to visualize financial 
time series data with the pandas library. 


Static 2D Plotting 


Before creating the sample data and starting to plot, some imports and 
customizations: 


In [1]: import matplotlib as mpl (1) 


In [2]: mpl.__version__ (2) 
Out[2]: '3.0.0' 


In [3]: import matplotlib.pyplot as plt © 

In [4]: plt.style.use('seaborn') (4) 

In [5]: mpl.rcParams['font.family'] = 'serif' (5) 

In [6]: %matplotlib inline 

Imports matplotlib with the usual abbreviation mpl. 
The version of matplotlib used. 


Imports the main plotting (sub)package with the usual abbreviation plt. 


Sets the plotting style to seaborn. 


© © © 8 8 


Sets the font to be serif in all plots. 
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One-Dimensional Data Sets 


The most fundamental, but nevertheless quite powerful, plotting function is 
plt.plot(). In principle, it needs two sets of numbers: 


x values 
A list or an array containing the x coordinates (values of the abscissa) 


y values 
A list or an array containing the y coordinates (values of the ordinate) 


The number of x and y values provided must match, of course. Consider the follow- 
ing code, whose output is presented in Figure 7-1: 


In [7]: import numpy as np 
In [8]: np.random.seed(1000) (13 
In [9]: y = np.random.standard_normal(20) (2) 


In [10]: x = np.arange(len(y)) © 
plt.plot(x, y); 


Fixes the seed for the random number generator for reproducibility. 
Draws the random numbers (y values). 


Fixes the integers (x values). 


o © 8 8 


Calls the plt.plot() function with the x and y objects. 
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Figure 7-1. Plot given x and y values 


plt.plot() notices when an ndarray object is passed. In this case, there is no need to 
provide the “extra” information of the x values. If one only provides the y values, 
plt.plot() takes the index values as the respective x values. Therefore, the following 
single line of code generates exactly the same output (see Figure 7-2): 


In [11]: plt.plot(y); 
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Figure 7-2. Plot given data as an ndarray object 


NumPy Arrays and matplotlib 


You can simply pass NumPy ndarray objects to matplotlib func- 
tions. matplotlib is able to interpret the data structures for simpli- 
fied plotting. However, be careful to not pass a too large and/or 
complex array. 


Since the majority of the ndarray methods return an ndarray object, one can also 
pass the object with a method (or even multiple methods, in some cases) attached. By 
calling the cumsum( ) method on the ndarray object with the sample data, one gets the 
cumulative sum of this data and, as to be expected, a different output (see Figure 7-3): 


In [12]: plt.plot(y.cumsum()); 


Static 2D Plotting | 171 


0.5 


0.0 


-0.5 


—1.0 


=1.5 


—2.0 


0.0 2.9 5.0 7.5 10.0 12.5 15.0 17.5 


Figure 7-3. Plot given an ndarray object with a method attached 


In general, the default plotting style does not satisfy typical requirements for reports, 
publications, etc. For example, one might want to customize the font used (e.g., for 
compatibility with LaTeX fonts), to have labels at the axes, or to plot a grid for better 
readability. This is where plotting styles come into play. In addition, matplotlib 
offers a large number of functions to customize the plotting style. Some are easily 
accessible; for others one has to dig a bit deeper. Easily accessible, for example, are 
those functions that manipulate the axes and those that relate to grids and labels (see 
Figure 7-4): 
In [13]: plt.plot(y.cumsum()) 
plt.grid(False) 
plt.axis('equal'); (2) 


@ Turns off the grid. 


@ Leads to equal scaling for the two axes. 
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0.0 2.5 5.0 7.5 10.0 12:5 15.0 -17.5 
Figure 7-4. Plot without grid 


Other options for plt.axis() are given in Table 7-1, the majority of which have to 
be passed as a str object. 


Table 7-1. Options for plt.axis() 


Parameter Description 


Empty Returns current axis limits 

off Turns axis lines and labels off 

equal Leads to equal scaling 

scaled Produces equal scaling via dimension changes 
tight Makes all data visible (tightens limits) 
image Makes all data visible (with data limits) 


[xmin, xmax, ymin, ymax] Sets limits to given (List of) values 


In addition, one can directly set the minimum and maximum values of each axis by 
using plt.xlim() and plt.ylim(). The following code provides an example whose 
output is shown in Figure 7-5: 


In [14]: plt.plot(y.cumsum()) 
plt.xlim(-1, 20) 
plt.ylim(np.min(y.cumsum()) - 1, 

np.max(y.cumsum()) + 1); 
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0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 


Figure 7-5. Plot with custom axis limits 


For the sake of better readability, a plot usually contains a number of labels—e.g., a 
title and labels describing the nature of the x and y values. These are added by the 
functions plt.title(), plt.xlabel(), and plt.ylabel(), respectively. By default, 
plot() plots continuous lines, even if discrete data points are provided. The plotting 
of discrete points is accomplished by choosing a different style option. Figure 7-6 
overlays (red) points and a (blue) line with line width of 1.5 points: 
In [15]: plt.figure(figsize=(10, 6)) (1) 

plt.plot(y.cumsum(), 'b', lw=1.5) (2) 

plt.plot(y.cumsum(), 'ro') © 

plt.xlabel('index') @ 


plt.ylabel('value') (5) 
plt.title('A Simple Plot'); Q 


Increases the size of the figure. 
Plots the data as a line in blue with line width of 1.5 points. 
Plots the data as red (thick) dots. 


Places a label on the x-axis. 


© 6 68 8 Ọ 


Places a label on the y-axis. 
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© Places a title. 


A Simple Plot 
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Figure 7-6. Plot with typical labels 
By default, plt.plot() supports the color abbreviations in Table 7-2. 


Table 7-2. Standard color abbreviations 


Character Color 


b Blue 

g Green 

r Red 

c Cyan 

m Magenta 
y Yellow 
k Black 

w White 


In terms of line and/or point styles, plt.plot() supports the characters listed in 


Table 7-3. 
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Table 7-3. Standard style characters 
- Solid line style 
-- Dashed line style 
Dash-dot line style 
Dotted line style 


Point marker 


P Pixel marker 
o Circle marker 
v Triangle _down marker 
^ Triangle _up marker 
< Triangle_left marker 
> Triangle_right marker 
1 Tri_down marker 
2 Tri_up marker 
3 Tri_left marker 
4 Tri_right marker 
s Square marker 
p Pentagon marker 
Star marker 
h Hexagon1 marker 
H Hexagon2 marker 
+ Plus marker 
x X marker 
D Diamond marker 
d Thin diamond marker 
| Vline marker 
Hline marker 


Any color abbreviation can be combined with any style character. In this way, one 
can make sure that different data sets are easily distinguished. The plotting style is 
also reflected in the legend. 


Two-Dimensional Data Sets 


Plotting one-dimensional data can be considered a special case. In general, data sets 
will consist of multiple separate subsets of data. The handling of such data sets fol- 
lows the same rules with matplotlib as with one-dimensional data. However, a 
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number of additional issues might arise in such a context. For example, two data sets 
might have such a different scaling that they cannot be plotted using the same y- 
and/or x-axis scaling. Another issue might be that one might want to visualize two 
different data sets in different ways, e.g., one by a line plot and the other by a bar plot. 


The following code generates a two-dimensional sample data set as a NumPy ndarray 
object of shape 20 x 2 with standard normally distributed pseudo-random numbers. 
On this array, the method cumsum() is called to calculate the cumulative sum of the 
sample data along axis 0 (i.e., the first dimension): 


In [16]: y = np.random.standard_normal((20, 2)).cumsum(axis=0) 


In general, one can also pass such two-dimensional arrays to plt.plot(). It will then 
automatically interpret the contained data as separate data sets (along axis 1, i.e., the 
second dimension). A respective plot is shown in Figure 7-7: 


In [17]: plt.figure(figsize=(10, 6)) 
plt.plot(y, lw=1.5) 
plt.plot(y, 'ro') 
plt.xlabel('index') 
plt.ylabel('value') 
plt.title('A Simple Plot'); 


A Simple Plot 


value 


0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 


Figure 7-7. Plot with two data sets 


In such a case, further annotations might be helpful to better read the plot. You can 
add individual labels to each data set and have them listed in the legend. The function 
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plt.legend() accepts different locality parameters. 0 stands for best location, in the 
sense that as little data as possible is hidden by the legend. 


Figure 7-8 shows the plot of the two data sets, this time with a legend. In the generat- 
ing code, the ndarray object is not passed as a whole but the two data subsets (y[:, 
0] and y[:, 1]) are accessed separately, which allows you to attach individual labels 
to them: 


In [18]: plt.figure(figsize=(10, 6)) 
plt.plot(y[:, 0], lw=1.5, label="ist') @ 
plt.plot(y[:, 1], lw=1.5, label='2nd') 1] 
plt.plot(y, 'ro') 
plt. legend(loc=0) (2) 
plt.xlabel('index') 
plt.ylabel('value') 
plt.title('A Simple Plot'); 


@ Defines labels for the data subsets. 


@ Places a legend in the “best” location. 


A Simple Plot 


= Alsy 
—— 2d 


0.0 2:5 5.0 FA 10.0 12.5 15.0 17.5 
index 


Figure 7-8. Plot with labeled data sets 


Further location options for plt. legend() include those presented in Table 7-4. 
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Table 7-4. Options for plt.legend() 


Loc Description 

Default Upper right 
Best possible 
Upper right 
Upper left 


Lower left 


0 

1 

2 

3 

4 Lower right 
5 Right 

6 Center left 

7 Center right 
8 Lower center 
9 Upper center 
10 Center 


Multiple data sets with a similar scaling, like simulated paths for the same financial 
risk factor, can be plotted using a single y-axis. However, often data sets show rather 
different scalings and the plotting of such data with a single y-scale generally leads to 
a significant loss of visual information. To illustrate the effect, the following example 
scales the first of the two data subsets by a factor of 100 and plots the data again (see 
Figure 7-9): 


In [19]: y[:, 0] = y[:, 0] * 100 @ 


In [20]: plt.figure(figsize=(10, 6)) 
plt.plot(y[:, 0], lw=1.5, label='1st') 
plt.plot(y[:, 1], lw=1.5, label='2nd') 
plt.plot(y, 'ro') 
plt.legend(loc=0) 
plt.xlabel('index') 
plt.ylabel('value') 
plt.title('A Simple Plot'); 


@  Rescales the first data subset. 
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Figure 7-9. Plot with two differently scaled data sets 


Inspection of Figure 7-9 reveals that the first data set is still “visually readable,” while 
the second data set now looks like a straight line with the new scaling of the y-axis. In 
a sense, information about the second data set now gets “visually lost.” There are two 
basic approaches to resolve this problem through means of plotting, as opposed to 


adjusting the data (e.g., through rescaling): 


e Use of two y-axes (left/right) 
e Use of two subplots (upper/lower, left/right) 


The following example introduces a second y-axis to the plot. Figure 7-10 now has 
two different y-axes. The left y-axis is for the first data set while the right y-axis is for 


the second. Consequently, there are also two legends: 


In [21]: fig, ax1 = plt.subplots() (1) 


plt: 
plt. 
plt. 
plt. 
plts 
plt. 


ax2 


plt. 
plt. 
plt. 
plt. 


plot(y[:, 0], 'b', lw=1.5, label='1st') 
plot(y[:, 0], 'ro') 

legend(loc=8) 

xlabel('index') 

ylabel('value 1st') 

title('A Simple Plot') 

= ax1.twinx() 

plot(y[:, 1], 'g', lw=1.5, label='2nd') 
plot(y[:, 1], 'ro') 

legend(loc=0) 

ylabel('value 2nd'); 
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Defines the figure and axis objects. 


Creates a second axis object that shares the x-axis. 
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Figure 7-10. Plot with two data sets and two y-axes 


The key lines of code are those that help manage the axes: 


fig, ax1 = plt.subplots() 
ax2 = ax1.twinx() 


By using the plt.subplots() function, one gets direct access to the underlying plot- 
ting objects (the figure, subplots, etc.). It allows one, for example, to generate a sec- 
ond subplot that shares the x-axis with the first subplot. In Figure 7-10, then, the two 
subplots actually overlay each other. 


Next, consider the case of two separate subplots. This option gives even more free- 
dom to handle the two data sets, as Figure 7-11 illustrates: 


In [22]: plt.figure(figsize=(10, 6)) 
plt.subplot(211) (1) 
plt.plot(y[:, 0], lw=1.5, label='1st') 
plt.plot(y[:, 0], 'ro') 
plt.legend(loc=0) 
plt.ylabel('value') 
plt.title('A Simple Plot') 
plt.subplot(212) (2) 
plt.plot(y[:, 1], 'g', lw=1.5, label='2nd') 
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plt.plot(y[:, 1], 'ro') 
plt. legend(loc=0) 
plt.xlabel('index') 
plt.ylabel('value'); 


Defines the upper subplot 1. 


Defines the lower subplot 2. 


A Simple Plot 
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Figure 7-11. Plot with two subplots 


The placing of subplots in a matplotlib figure object is accomplished by the use of a 
special coordinate system. plt.subplot() takes as arguments three integers for 
numrows, numcols, and fignum (either separated by commas or not). numrows speci- 
fies the number of rows, numcols the number of columns, and fignum the number of 
the subplot, starting with 1 and ending with numrows * numcols. For example, a fig- 
ure with nine equally sized subplots would have numrows=3, numcols=3, and fig 
num=1,2,...,9. The lower-right subplot would have the following “coordinates”: 
plt.subplot(3, 3, 9). 


Sometimes, it might be necessary or desired to choose two different plot types to vis- 
ualize such data. With the subplot approach one has the freedom to combine arbi- 
trary kinds of plots that matplotlib offers.' 


1 For an overview of which plot types are available, visit the matplotlib gallery. 
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Figure 7-12 combines a line/point plot with a bar chart: 


In [23]: plt.figure(figsize=(10, 6)) 
plt.subplot(121) 
plt.plot(y[:, 0], lw=1.5, label='1st') 
plt.plot(y[:, 0], 'ro') 
plt. legend(loc=0) 
plt.xlabel('index') 
plt.ylabel('value') 
plt.title('1st Data Set') 
plt.subplot(122) 
plt.bar(np.arange(len(y)), y[:, 1], width=0.5, 

color='g', label='2nd') 

plt. legend(loc=0) 
plt.xlabel('index') 
plt.title('2nd Data Set'); 


@ Creates abar subplot. 
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Figure 7-12. Plot combining line/point subplot with bar subplot 


Other Plot Styles 


When it comes to two-dimensional plotting, line and point plots are probably the 
most important ones in finance; this is because many data sets embody time series 
data, which generally is visualized by such plots. Chapter 8 addresses financial time 
series data in detail. However, this section sticks with a two-dimensional data set of 
random numbers and illustrates some alternative, and for financial applications use- 
ful, visualization approaches. 
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The first is the scatter plot, where the values of one data set serve as the x values for 
the other data set. Figure 7-13 shows such a plot. This plot type might be used, for 
example, for plotting the returns of one financial time series against those of another 
one. This example uses a new two-dimensional data set with some more data: 


In [24]: y = np.random.standard_normal((1000, 2)) (1) 


In [25]: plt.figure(figsize=(10, 6)) 
plt.plot(y[:, 0], y[:, 1], 'ro') (2) 
plt.xlabel('1st') 
plt.ylabel('2nd') 
plt.title('Scatter Plot'); 


@ Creates a larger data set with random numbers. 


© Scatter plot produced via the plt.plot() function. 


Scatter Plot 


1st 


Figure 7-13. Scatter plot via plt.plot() function 


matplotlib also provides a specific function to generate scatter plots. It basically 
works in the same way, but provides some additional features. Figure 7-14 shows the 
corresponding scatter plot to Figure 7-13, this time generated using the plt.scat 
ter() function: 

In [26]: plt.figure(figsize=(10, 6)) 


plt.scatter(y[:, 0], y[:, 1], marker='0') (1) 
plt.xlabel('1st') 
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plt.ylabel('2nd') 
plt.title('Scatter Plot'); 


@ Scatter plot produced via the plt.scatter() function. 


Scatter Plot 


2nd 


Figure 7-14. Scatter plot via plt.scatter() function 


Among other things, the plt.scatter() plotting function allows the addition of a 
third dimension, which can be visualized through different colors and be described 
by the use of a color bar. Figure 7-15 shows a scatter plot where there is a third 
dimension illustrated by different colors of the single dots and with a color bar as a 
legend for the colors. To this end, the following code generates a third data set with 
random data, this time consisting of integers between 0 and 10: 


In [27]: c = np.random.randint(0, 10, len(y)) 


In [28]: plt.figure(figsize=(10, 6)) 

plt.scatter(y[:, 0], y[:, 1], 
CEC; 
cmap='coolwarm', (2) 
marker='0') (3) 

plt.colorbar() 

plt.xlabel('1st') 

plt.ylabel('2nd') 

plt.title('Scatter Plot'); 


© Includes the third data set. 
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© Chooses the color map. 


© Defines the marker to be a thick dot. 
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Figure 7-15. Scatter plot with third dimension 


Another type of plot, the histogram, is also often used in the context of financial 
returns. Figure 7-16 puts the frequency values of the two data sets next to each other 
in the same plot: 


In [29]: plt.figure(figsize=(10, 6)) 
plt.hist(y, label=['1st', '2nd'], bins=25) 1) 
plt.legend(loc=0) 
plt.xlabel('value') 
plt.ylabel('frequency') 
plt.title('Histogram'); 


@ Histogram plot produced via the plt.hist() function. 
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Figure 7-16. Histogram for two data sets 


Since the histogram is such an important plot type for financial applications, let’s 
take a closer look at the use of plt.hist(). The following example illustrates the 


parameters that are supported: 


plt.hist(x, bins=10, range=None, normed=False, weights=None, cumulative=False, 
bottom=None, histtype='bar', align='mid', orientation='vertical', rwidth=None, 


log=False, color=None, label=None, stacked=False, hold=None, **kwargs) 


Table 7-5 provides a description of the main parameters of the plt.hist() function. 


Table 7-5. Parameters for pit.hist() 


x List object(s), ndarray object 
bins Number of bins 

range Lower and upper range of bins 
normed Norming such that integral value is 1 
weights Weights for every value in x 


cumulative Every bin contains the counts of the lower bins 

histtype Options (strings): bar, barstacked, step, stepfilled 
align Options (strings): Left, mid, right 

orientation Options (strings): horizontal, vertical 


rwidth Relative width of the bars 
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log Log scale 

color Color per data set (array-like) 

label String or sequence of strings for labels 
stacked Stacks multiple data sets 


Figure 7-17 shows a similar plot; this time, the data of the two data sets is stacked in 
the histogram: 


In [30]: plt.figure(figsize=(10, 6)) 
plt.hist(y, label=['1st', '2nd'], color=['b', 'g'], 
stacked=True, bins=20, alpha=0.5) 
plt. legend(loc=0) 
plt.xlabel('value') 
plt.ylabel('frequency') 
plt.title('Histogram'); 
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Figure 7-17. Stacked histogram for two data sets 


Another useful plot type is the boxplot. Similar to the histogram, the boxplot allows 
both a concise overview of the characteristics of a data set and easy comparison of 
multiple data sets. Figure 7-18 shows such a plot for our data sets: 


In [31]: fig, ax = plt.subplots(figsize=(10, 6)) 
plt.boxplot(y) 
plt.setp(ax, xticklabels=['1st', '2nd']) (2) 
plt.xlabel('data set') 
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plt.ylabel('value') 
plt.title('Boxplot'); 


@ Boxplot produced via the plt.boxplot() function. 


@ Sets individual x labels. 
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Figure 7-18. Boxplot for two data sets 


This last example uses the function plt.setp(), which sets properties for a (set of) 
plotting instance(s). For example, consider a line plot generated by: 
line = plt.plot(data, 'r') 
The following code changes the style of the line to “dashed”: 
plt.setp(line, linestyle='--') 
This way, one can easily change parameters after the plotting instance (“artist 


object”) has been generated. 


As a final illustration in this section, consider a mathematically inspired plot that can 
also be found as an example in the matplotlib gallery. It plots a function and high- 
lights graphically the area below the function from a lower and to an upper limit—in 
other words, the integral value of the function between the lower and upper limits. 
The integral (value) to be illustrated is f? f (x)dx with f(x) = set +lla= z and 

= 2, Figure 7-19 shows the resulting plot and demonstrates that matplotlib seam- 
lessly handles LaTeX typesetting for the inclusion of mathematical formulae into 
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plots. First, the function definition, with integral limits as variables and data sets for 
the x and y values: 


© O © © O © Ọ 


In [32]: def func(x): 


return 0.5 * np.exp(x) + 1 (1) 
a, b =0.5, 1.5 @ 
x = np.linspace(0, 2) © 
y = func(x) 
Ix = np.linspace(a, b) (5) 
Iy = func(Ix) (6 
verts = [(a, 0)] + list(zip(Ix, Iy)) + [(b, 0)] @ 


The function definition. 

The integral limits. 

The x values to plot the function. 

The y values to plot the function. 

The x values within the integral limits. 
The y values within the integral limits. 


The list object with multiple tuple objects representing coordinates for the 
polygon to be plotted. 


Second, the plotting itself, which is a bit involved due to the many single objects to be 
placed explicitly: 


In [33]: from matplotlib.patches import Polygon 


fig, ax = plt.subplots(figsize=(10, 6)) 

plt.plot(x, y, 'b', Linewidth=2) (1) 

plt.ylim(bottom=0) (2) 

poly = Polygon(verts, facecolor='0.7', edgecolor='0.5') © 

ax.add_patch(poly) 

plt.text(0.5 * (a + b), 1, r'$\int_a^b f(x)\mathrm{d}x$', 
horizontalalignment='center', fontsize=20) (4) 

plt.figtext(0.9, 0.075, '$x$') (5) 

plt.figtext(0.075, 0.9, '$f(x)$') © 

ax.set_xticks((a, b)) Q 

ax.set_xticklabels(('$a$', '$b$')) © 

ax.set_yticks([func(a), func(b)]) (7) 

ax.set_yticklabels(('$f(a)$', '$f(b)$')); (7) 


© Plots the function values as a blue line. 


© Defines the minimum y value for the ordinate axis. 
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Plots the polygon (integral area) in gray. 


Places the integral formula in the plot. 


© 

(4) 

© Places the axis labels. 
© Places the x labels. 
(7) 


Places the y labels. 
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Figure 7-19. Exponential function, integral area, and LaTeX labels 
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There are not too many fields in finance that really benefit from visualization in three 
dimensions. However, one application area is volatility surfaces showing implied vol- 
atilities simultaneously for a number of times-to-maturity and strikes of the traded 
options used. See also Appendix B for an example of value and vega surfaces being 
visualized for a European call option. In what follows, the code artificially generates a 
plot that resembles a volatility surface. To this end, consider the parameters: 


Strike values between 50 and 150 
e Times-to-maturity between 0.5 and 2.5 years 
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This provides a two-dimensional coordinate system. The NumPy np.meshgrid() func- 
tion can generate such a system out of two one-dimensional ndarray objects: 


In [34]: strike = np.linspace(50, 150, 24) (1) 
In [35]: ttm = np.linspace(0.5, 2.5, 24) (2) 
In [36]: strike, ttm = np.meshgrid(strike, ttm) © 


In [37]: strike[:2].round(1) © 
Out[37]: array([[ 50. , 54.3, 58.7, 63. , 67.4, 71.7, 76.1, 80.4, 84.8, 
89:17. 93.5,.. 97:85 192.2, 196.5, 0:9, H2 199.6,. 123,9, 
128.3, 132.6; 137. ; 141,3, 145.7, 150. |; 
[ SO... 5. SA SST 63) 5 67.85 2T Tol; BOA., SA8, 
89:1; 93:5, 97.8, 192.2, 196.5, 10:9,- H32 199.6,. 123,9, 
128:3; 132,65. 137+ - 141-3.. 145.75. 150. JJ) 


In [38]: iv = (strike - 100) ** 2 / (100 * strike) / ttm (4) 

In [39]: iv[:5, :3] @ 

Out[39]: array([[1. » 0.76695652, 0.58132045], 
[0.85185185, 0.65333333, 0.4951989 ], 
[0.74193548, 0.56903226, 0.43130227], 


[0.65714286, 0.504 , ©.38201058], 
[0.58974359, 0.45230769, 0.34283001]]) 


The ndarray object with the strike values. 
The ndarray object with the time-to-maturity values. 


The two two-dimensional ndarray objects (grids) created. 
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The dummy implied volatility values. 
The plot resulting from the following code is shown in Figure 7-20: 


In [40]: from mpl_toolkits.mplot3d import Axes3D (1) 

fig = plt.figure(figsize=(10, 6)) 

ax = fig.gca(projection='3d') @ 

surf = ax.plot_surface(strike, ttm, iv, rstride=2, cstride=2, 
cmap=pLt.cm.coolwarm, Linewidth=0.5, 
antialiased=True) 

ax.set_xlabel('strike') (4) 

ax.set_ylabel('time-to-maturity') (5) 

ax.set_zlabel('implied volatility') Q 

fig.colorbar(surf, shrink=0.5, aspect=5); (7) 


@ Imports the relevant 3D plotting features, which is required although Axes3D is 
not directly used. 
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Sets up a canvas for 3D plotting. 
Creates the 3D plot. 
Sets the x-axis label. 
Sets the y-axis label. 


Sets the z-axis label. 
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Creates a color bar. 
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Figure 7-20. 3D surface plot for (dummy) implied volatilities 


Table 7-6 provides a description of the different parameters the plt.plot_surface() 


function can take. 


Table 7-6. Parameters for plot_surface() 


X, Y, Z Data values as 2D arrays 
rstride Array row stride (step size) 
cstride Array column stride (step size) 
color Color of the surface patches 
cmap Color map for the surface patches 


Static 3D Plotting 
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Parameter Description 
facecolors Face colors for the individual patches 


norm Instance of Normalize to map values to colors 
vmin Minimum value to map 

vmax Maximum value to map 

shade 


Whether to shade the face colors 


As with two-dimensional plots, the line style can be replaced by single points or, as in 
what follows, single triangles. Figure 7-21 plots the same data as a 3D scatter plot but 
now also with a different viewing angle, using the view_init() method to set it: 


In [41]: fig = plt.figure(figsize=(10, 6)) 


ax = fig.add_subplot(111, projection='3d') 
ax.view_init(30, 60) 


ax.scatter(strike, ttm, iv, zdir='z', s=25, 


c='b', marker='*%') (2) 
ax.set_xlabel('strike') 


ax.set_ylabel('time-to-maturity') 
ax.set_zlabel('implied volatility’); 


Sets the viewing angle. 


Creates a 3D scatter plot. 
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Figure 7-21. 3D scatter plot for (dummy) implied volatilities 
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Interactive 2D Plotting 


matplotlib allows you to create plots that are static bitmap objects or of PDF format. 
Nowadays, there are many libraries available to create interactive plots based on the 
D3.js standard. Such plots enable zooming in and out, hover effects for data inspec- 
tion, and more. They can in general also be easily embedded in web pages. 


A popular platform and plotting library is plotly. It is dedicated to visualization for 
data science and is in widespread use in the data science community. Major benefits 
of plotly are its tight integration with the Python ecosystem and the ease of use—in 
particular when combined with pandas DataFrame objects and the wrapper package 
Cufflinks. 


For some functionality, a free account is required. Once the credentials are granted 
they should be stored locally for permanent use. For details, see the “Getting Started 
with Plotly for Python” guide. 


This section focuses on selected aspects only, in that Cufflinks is used exclusively to 
create interactive plots from data stored in DataFrame objects. 


Basic Plots 


To get started from within a Jupyter Notebook context, some imports are required 
and the notebook mode should be turned on: 


In [42]: import pandas as pd 


In [43]: import cufflinks as cf (1) 
In [44]: import plotly.offline as plyo (2) 


In [45]: plyo.init_notebook_mode(connected=True) © 
Imports Cufflinks. 
Imports the offline plotting capabilities of plotly. 


Turns on the notebook plotting mode. 


Remote or Local Rendering 


With plotly, there is also the option to get the plots rendered on 
the plotly servers. However, the notebook mode is generally much 
faster, in particular when dealing with larger data sets. That said, 
some functionality, like the streaming plot service of plotly, is 
only available via communication with the server. 
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The examples that follow rely again on pseudo-random numbers, this time stored in 
a DataFrame object with DatetimeIndex (i.e., as time series data): 


© O © © © 8 8 


In [46]: a = np.random.standard_normal((250, 5)).cumsum(axis=0) (1) 
In [47]: index = pd.date_range('2019-1-1', (2) 
freq='B', © 
periods=len(a)) (4) 
In [48]: df = pd.DataFrame(100 + 5 * a, (5) 
columns=list('abcde'), 6] 
index=index) @ 
In [49]: df.head() © 
Out [49]: a b c d e 
2019-01-01 109.037535 98.693865 104.474094 96.878857 100.621936 
2019-01-02 107.598242 97.005738 106.789189 97.966552 100.175313 
2019-01-03 101.639668 100.332253 103.183500 99.747869 107.902901 


2019-01-04 98.500363 101.208283 100.966242 94.023898 104.387256 
2019-01-07 93.941632 103.319168 105.674012 95.891062 86.547934 


The standard normally distributed pseudo-random numbers. 
The start date for the DatetimeIndex object. 

The frequency (“business daily”). 

The number of periods needed. 

A linear transform of the raw data. 

The column headers as single characters. 

The DatetimeIndex object. 


The first five rows of data. 


Cufflinks adds a new method to the DataFrame class: df .iplot(). This method uses 
plotly in the backend to create interactive plots. The code examples in this section 
all make use of the option to download the interactive plot as a static bitmap, which 
in turn is embedded in the text. In the Jupyter Notebook environment, the created 
plots are all interactive. The result of the following code is shown in Figure 7-22: 


In [50]: plyo.iplot( 1) 
df.iplot(asFigure=True), (2) 
image='png', 
filename='ply_01' (4) 


196 


| Chapter 7: Data Visualization 


This makes use of the offline (notebook mode) capabilities of plotly. 


The df.iplot() method is called with parameter asFigure=True to allow for 
local plotting and embedding. 


The image option provides in addition a static bitmap version of the plot. 


The filename for the bitmap to be saved is specified (the file type extension is 
added automatically). 


Jan 2019 Mar 2019 May 2019 Jul 2019 Sep 2019 Nov 2019 


Figure 7-22. Line plot for time series data with plotly, pandas, and Cufflinks 


As with matplotlib in general and with the pandas plotting functionality, there are 
multiple parameters available to customize such plots (see Figure 7-23): 


In [51]: plyo.iplot( 
df[['a', 'b']].iplot(asFigure=True, 
theme='polar', 
title='A Time Series Plot', @ 
xTitle='date', 
yTitle='value', (4) 


mode={'a': 'markers', 'b': 'lines+markers'}, (5) 
symbol={'a': 'circle', 'b': 'diamond'}, (6) 
size=3.5, 
colors={'a': 'blue', 'b': 'magenta'}, (83 

)s 


image='png', 
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filename='ply_02' 
) 


Selects a theme (plotting style) for the plot. 

Adds a title. 

Adds an x-axis label. 

Adds a y-axis label. 

Defines the plotting mode (line, marker, etc.) by column. 
Defines the symbols to be used as markers by column. 


Fixes the size for all markers. 
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Specifies the plotting color by column. 


A Time Series Plot 
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Figure 7-23. Line plot for two columns of the DataFrame object with customizations 


Similar to matplotlib, plotly allows for a number of different plotting types. Plot- 
ting types available via Cufflinks are chart, scatter, bar, box, spread, ratio, heat 
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map, surface, histogram, bubble, bubble3d, scatter3d, scattergeo, ohlc, candle, 
pie, and choropleth. As an example of a plotting type different from a line plot, con- 
sider the histogram (see Figure 7-24): 


In [52]: plyo.iplot( 

df.iplot(kind='hist', @ 
subplots=True, e 
bins=15, 
asFigure=True), 

image='png', 

filename='ply_03' 

) 


@ Specifies the plotting type. 
© Requires separate subplots for every column. 


© Sets the bins parameter (buckets to be used = bars to be plotted). 
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Figure 7-24. Histograms per column of the DataFrame object 


Financial Plots 


The combination of plotly, Cufflinks, and pandas proves particularly powerful 
when working with financial time series data. Cufflinks provides specialized func- 
tionality to create typical financial plots and to add typical financial charting ele- 
ments, such as the Relative Strength Index (RSI), to name but one example. To this 
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end, a persistent QuantFig object is created that can be plotted the same way as a 
DataFrame object with Cufflinks. 


This subsection uses a real financial data set, time series data for the EUR/USD 
exchange rate (source: FKCM Forex Capital Markets Ltd.): 


In [54]: raw = pd.read_csv('../../source/fxcm_eur_usd_eod_data.csv', 
index_col=0, parse_dates=True) 


In [55]: raw.info() e 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 1547 entries, 2013-01-01 22:00:00 to 2017-12-31 22:00:00 
Data columns (total 8 columns): 


BidOpen 1547 non-null float64 
BidHigh 1547 non-null float64 
BidLow 1547 non-null float64 
BidClose 1547 non-null float64 
AskOpen 1547 non-null float64 
AskHigh 1547 non-null float64 
AskLow 1547 non-null float64 


AskClose 1547 non-null float64 
dtypes: float64(8) 
memory usage: 108.8 KB 


In [56]: quotes = raw[['AskOpen', 'AskHigh', 'AskLow', 'AskClose']] © 
quotes = quotes.iloc[-60:] 
quotes.tail() 

Out[56]: AskOpen AskHigh AskLow AskClose 
2017-12-25 22:00:00 1.18667 1.18791 1.18467 1.18587 
2017-12-26 22:00:00 1.18587 1.19104 1.18552 1.18885 
2017-12-27 22:00:00 1.18885 1.19592 1.18885 1.19426 
2017-12-28 22:00:00 1.19426 1.20256 1.19369 1.20092 
2017-12-31 22:00:00 1.20092 1.20144 1.19994 1.20144 


@ Reads the financial data from a CSV file. 


@ The resulting DataFrame object consists of multiple columns and more than 
1,500 data rows. 


© Selects four columns from the DataFrame object (Open-High-Low-Close, or 
OHLC). 


© Only a few data rows are used for the visualization. 


© Returns the final five rows of the resulting data set quotes. 


During instantiation, the QuantFig object takes the DataFrame object as input and 
allows for some basic customization. Plotting the data stored in the QuantFig object 
qf then happens with the qf.iplot() method (see Figure 7-25): 
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In [57]: qf = cf.QuantFig( 
quotes, (1) 
title='EUR/USD Exchange Rate', (2) 
legend='top', 
name='EUR/USD' © 
) 
In [58]: plyo.iplot( 
qf.iplot(asFigure=True), 
image='png', 


filename='qf_01' 
) 


The DataFrame object is passed to the QuantFig constructor. 
This adds a figure title. 


The legend is placed at the top of the plot. 
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This gives the data set a name. 
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Figure 7-25. OHLC plot of EUR/USD data 
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Adding typical financial charting elements, such as Bollinger bands, is possible via 
different methods available for the QuantFig object (see Figure 7-26): 


In [59]: qf.add_bollinger_bands(periods=15, (1) 
boll_std=2) @ 


In [60]: plyo.iplot(qf.iplot(asFigure=True), 
image='png', 
filename='qf_02' 
) 


© The number of periods for the Bollinger band. 


© The number of standard deviations to be used for the band width. 
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Figure 7-26. OHLC plot of EUR/USD data with Bollinger band 


Certain financial indicators, such as RSI, may be added as a subplot (see Figure 7-27): 


In [61]: qf.add_rsi(periods=14, (1) 
showbands=False) e 


In [62]: plyo.iplot( 
qf .iplot(asFigure=True), 
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image='png', 
filename='qf_03' 
) 


@ Fixes the RSI period. 


© Does not show an upper or lower band. 
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Figure 7-27. OHLC plot of EUR/USD data with Bollinger band and RSI 


Conclusion 


matpLotlib can be considered both the benchmark and an all-rounder when it comes 
to data visualization in Python. It is tightly integrated with NumPy and pandas, and the 
basic functionality is easily and conveniently accessed. However, matplotlib is a 
mighty library with a somewhat complex API. This makes it impossible to give a 
broad overview of all the capabilities of matpLotlib in this chapter. 


This chapter introduces the basic functions of matplotlib for 2D and 3D plotting 
useful in many financial contexts. Other chapters provide further examples of how to 
use the package for visualization. 
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In addition, this chapter covers plotly in combination with Cufflinks. This combi- 
nation makes the creation of interactive D3.js plots a convenient affair since only a 
single method call on a DataFrame object is necessary in general. All technicalities are 
taken care of in the backend. Furthermore, Cufflinks provides with the QuantFig 
object an easy way to create typical financial plots with popular financial indicators. 


Further Resources 


A variety of resources for matplotlib can be found on the web, including: 


e The home page, which is probably the best starting point 
e A gallery with many useful examples 

e A tutorial for 2D plotting 

e A tutorial for 3D plotting 


It has become kind of a standard routine to consult the gallery, look there for an 
appropriate visualization example, and start with the corresponding example code. 


The major resources for the plotly and Cufflinks packages are also online. These 
include: 

e The plotly home page 

e A tutorial to get started with plotly for Python 

e The Cufflinks GitHub page 
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CHAPTER 8 
Financial Time Series 


[T]ime is what keeps everything from happening at once. 


—Ray Cummings 


Financial time series data is one of the most important types of data in finance. This 
is data indexed by date and/or time. For example, prices of stocks over time represent 
financial time series data. Similarly, the EUR/USD exchange rate over time represents 
a financial time series; the exchange rate is quoted in brief intervals of time, and a 
collection of such quotes then is a time series of exchange rates. 


There is no financial discipline that gets by without considering time an important 
factor. This mainly is the same as with physics and other sciences. The major tool to 
cope with time series data in Python is pandas. Wes McKinney, the original and main 
author of pandas, started developing the library when working as an analyst at AQR 
Capital Management, a large hedge fund. It is safe to say that pandas has been 
designed from the ground up to work with financial time series data. 


The chapter is mainly based on two financial time series data sets in the form of 
comma-separated values (CSV) files. It proceeds along the following lines: 


“Financial Data” on page 206 
This section is about the basics of working with financial times series data using 
pandas: data import, deriving summary statistics, calculating changes over time, 
and resampling. 


“Rolling Statistics” on page 217 
In financial analysis, rolling statistics play an important role. These are statistics 
calculated in general over a fixed time interval that is rolled forward over the 
complete data set. A popular example is simple moving averages. This section 
illustrates how pandas supports the calculation of such statistics. 
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“Correlation Analysis” on page 222 
This section presents a case study based on financial time series data for the S&P 
500 stock index and the VIX volatility index. It provides some support for the 
stylized (empirical) fact that both indices are negatively correlated. 


“High-Frequency Data” on page 228 
This section works with high-frequency data, or tick data, which has become 
commonplace in finance. pandas again proves powerful in handling such data 
sets. 


Financial Data 


This section works with a locally stored financial data set in the form of a CSV file. 
Technically, such files are simply text files with a data row structure characterized by 
commas that separate single values. Before importing the data, some package imports 
and customizations: 


In [1]: import numpy as np 
import pandas as pd 
from pylab import mpl, plt 
plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


Data Import 


pandas provides a number of different functions and DataFrame methods to import 
data stored in different formats (CSV, SQL, Excel, etc.) and to export data to different 
formats (see Chapter 9 for more details). The following code uses the pd. read_csv() 
function to import the time series data set from the CSV file:! 


In [2]: filename = '../../source/tr_eikon_eod_data.csv' (1) 


In [3]: f = open(filename, 'r') (2) 
f.readlines()[:5] @ 
Out[3]: ['Date,AAPL.O,MSFT.O,INTC.O,AMZN.O,GS.N,SPY,.SPX,.VIX,EUR=,XAU=,GDX, 
»GLD\n', 
'2010-01-01,,,,,,5,,1.4323,1096.35,,\n', 
"2010-01-04, 30.57282657,30.95,20.88,133.9,173.08,113.33,1132.99,20.04, 
»1.4411,1120.0,47.71,109.8\n', 
"2010-01-05, 30.625683660000004, 30.96,20.87,134.69,176.14,113.63,1136.52, 
,19.35,1.4368,1118.65,48.17,109.7\n', 
"2010-01-06, 30.138541290000003, 30.77,20.8,132.25,174.26,113.71,1137.14, 
J19: 16;1,4412,1138.5.49.34,211:51\n"] 


1 The file contains end-of-day (EOD) data for different financial instruments as retrieved from the Thomson 
Reuters Eikon Data API. 


206 | Chapter 8: Financial Time Series 


In [4]: data = 


In [5]: data.info() (6) 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 2216 entries, 2010-01-01 to 2018-06-29 
Data columns (total 12 columns): 


AAPL.O 
MSFT.O 
INTC.O 
AMZN .O 
GS.N 
SPY. 
. SPX 
-VIX 
EUR= 
XAU= 
GDX 
GLD 
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2138 non-null 
2138 non-null 
2138 non-null 
2138 non-null 
2138 non-null 
2138 non-null 
2138 non-null 
2138 non-null 
2216 non-null 
2211 non-null 
2138 non-null 
2138 non-null 
dtypes: float64(12) 

memory usage: 


pd.read_csv(filename, © 
index_col=0, (4) 


parse_dates=True) (5) 


225.1. KB 


Specifies the path and filename. 


The resulting DataFrame object. 


float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 


Specifies that the index values are of type datetime. 


Shows the first five rows of the raw data (Linux/Mac). 
The filename passed to the pd. read_csv() function. 


Specifies that the first column shall be handled as an index. 


At this stage, a financial analyst probably takes a first look at the data, either by 
inspecting or visualizing it (see Figure 8-1): 


In [6]: data. head() (1) 


Out[6]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY . SPX . VIX 
Date 
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 
2010-01-04 30.572827 30.950 20.88 133.90 173.08 113.33 1132.99 20.04 
2010-01-05 30.625684 30.960 20.87 134.69 176.14 113.63 1136.52 19.35 
2010-01-06 30.138541 30.770 20.80 132.25 174.26 113.71 1137.14 19.16 
2010-01-07 30.082827 30.452 20.60 130.00 177.67 114.19 1141.69 19.06 
EUR= XAU= GDX GLD 
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In [7]: data.tail() @ 


Date 

2010-01-01 
2010-01-04 
2010-01-05 
2010-01-06 
2010-01-07 


Out[7]: 


Date 

2018-06-25 
2018-06-26 
2018-06-27 
2018-06-28 
2018-06-29 


Date 

2018-06-25 
2018-06-26 
2018-06-27 
2018-06-28 
2018-06-29 


1.4323 
1.4411 
1.4368 
1.4412 
1.4318 


AAPL.O 


182.17 
184.43 
184.16 
185.50 
185.11 


EUR= 


.1702 
.1645 
+1552 
.1567 
. 1683 


PRPPPRPRP 


1096.35 
1120.00 
1118.65 
1138.50 
1131.90 


MSFT.O 


98.39 
99,08 
97.54 
98.63 
98.61 


XAU= 


1265.00 
1258.64 
1251.62 
1247.88 
1252:25 


NaN 
47.71 
48.17 
49.34 
49.10 


INTC.O 


SOLJ 
49.67 
48.76 
49.25 
49.71 


GDX 


22.01 
21:95 
21.81 
21:93 
2231 


NaN 
109.80 
109.70 
T1151 
110.82 


AMZN.O 


1663.15 
1691.09 
1660.51 
1701.45 
1699.80 


GLD 


119.89 
119.26 
118.58 
118.22 
118.65 


GS.N 


221.54 
221.58 
220.18 
223.42 
220.57 


In [8]: data.plot(figsize=(10, 12), subplots=True); © 


@ The first five rows ... 


(27 


... and the final five rows are shown. 


SPY 


271.00 
271.60 
269.35 
270.89 
272.28 


© This visualizes the complete data set via multiple subplots. 


. SPX 


2717.07 
2723.06 
2699,63 
2716:31 
2718:37 


VIX \ 


1733 
45,92 
37,91 
16.85 
16.09 
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Figure 8-1. Financial time series data as line plots 


The data used is from the Thomson Reuters (TR) Eikon Data API. In the TR world 
symbols for financial instruments are called Reuters Instrument Codes (RICs). The 
financial instruments that the single RICs represent are: 


In [9]: instruments = ['Apple Stock', 'Microsoft Stock', 
'Intel Stock', 'Amazon Stock', 'Goldman Sachs Stock', 
'SPDR S&P 500 ETF Trust', 'S&P 500 Index', 
'VIX Volatility Index', 'EUR/USD Exchange Rate', 
'Gold Price', 'VanEck Vectors Gold Miners ETF', 
'SPDR Gold Trust'] 


In [10]: for ric, name in zip(data.columns, instruments): 
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print('{:8s} | {}'.format(ric, name)) 


AAPL.O | Apple Stock 


MSFT.O | Microsoft Stock 

INTC.O | Intel Stock 

AMZN.O | Amazon Stock 

GS.N | Goldman Sachs Stock 

SPY | SPDR S&P 500 ETF Trust 

«SPX | S&P 500 Index 

. VIX | VIX Volatility Index 

EUR= | EUR/USD Exchange Rate 

XAU= | Gold Price 

GDX | VanEck Vectors Gold Miners ETF 

GLD | SPDR Gold Trust 
Summary Statistics 


The next step the financial analyst might take is to have a look at different summary 
statistics for the data set to get a “feeling” for what it is all about: 


In [11]: 


In [12]: 
Out[12]: 


count 2138. 
93, 
40. 
2T: 
60. 
90. 
ATs 
93. 


mean 
std 
min 
25% 
50% 
75% 
max 


data.info() (1) 


<class 'pandas.core.frame.DataFrame'> 


DatetimeIndex: 2216 entries, 2010-01-01 to 2018-06-29 


Data columns (total 12 columns): 


AAPL.O 2138 non-null float64 
MSFT.O 2138 non-null float64 
INTC.O 2138 non-null float64 
AMZN.O 2138 non-null float64 
GS.N 2138 non-null float64 
SPY 2138 non-null float64 
. SPX 2138 non-null float64 
.VIX 2138 non-null float64 
EUR= 2216 non-null float64 
XAU= 2211 non-null float64 
GDX 2138 non-null float64 
GLD 2138 non-null float64 
dtypes: float64(12) 

memory usage: 225.1 KB 


data.describe().round(2) (2) 


AAPL.O 


1 
4 


00 
46 
55 
44 
29 
59 
24 
98 


EUR= 
count 2216.00 


MSFT.O INTC.O AMZN.O GS.N 


2138.00 2138.00 2138.00 2138. 
44.56 29.36 480.46 170. 
19.353 8.17 372.31 42. 
23.01 17.66 108.61 87: 
28:57 22.51 213.60 146. 
39.66 27.33 322.06 164. 
54.37 34.71. 698.85 192. 

102.49 57.08 1750.08 273. 


XAU= GDX GLD 
2211.00 2138.00 2138.00 


00 
22 
48 
70 
61 
43 
13 
38 


2138. 
180. 
48. 
102, 
133: 
186. 
210. 
286. 


SPY 
00 
32 
19 
20 
99 
32 
99 
58 


«SPX 


2138. 
1802. 

483. 
1022. 
1338. 
1863. 
2108. 
2872. 


00 
71 
34 
58 
oF 
08 
94 
87 
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mean 
std 
min 
25% 
50% 
75% 
max 


29 
s11 
.04 
13 
vet 
335 
-48 


PRPRPRPROR 


1349. 

188. 
1051; 
1221. 
1292, 
1428. 
1898. 


01 
#5 
36 
53 
61 
24 
99 


33: 
15: 
12, 
22: 
23% 
48. 
66. 


SI 
17 
47 
14 
62 
34 
63 


130. 

18. 
100. 
117: 
124. 
139; 
184. 


09 
78 
50 
40 
00 
00 
59 


@ info() gives some metainformation about the DataFrame object. 


© describe() provides useful standard statistics per column. 


Quick Insights 


pandas provides a number of methods to gain a quick overview 
over newly imported financial time series data sets, such as info() 
and describe(). They also allow for quick checks of whether the 
importing procedure worked as desired (e.g., whether the Data 
Frame object indeed has an index of type DatetimeIndex). 


There are also options, of course, to customize what types of statistic to derive and 


display: 


In [13]: data.mean() @ 
Out[13]: AAPL.O 


MSFT. 
INTC. 
AMZN. 


GS.N 
SPY 
. SPX 
- VIX 
EUR= 
XAU= 
GDX 
GLD 


OGOOGO 


a3 


130; 


dtype: float64 


455973 


«501115 
. 364192 
-461251 
-216221 
323029 
. 713106 
-027133 
. 248587 
.014130 
«566525 


086590 


In [14]: data.aggregate([min, (2) 
np.mean, © 
np. std, 
np.median, (5) 
max] 
).round(2) 
Out[14]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY . SPX -VIX  EUR= 
min 27.44 23-01 17.66 108.61 87.70 102.20 1022.58 9.14 1.04 
mean 93.46 44.56 29.36 480.46 170.22 180.32 1802.71 17.03 1.25 
std 40.55 19.53 8.17 372.31 42.48 48.19 483.34 5.88 0.11 
median 90.58 39:66. 27,33 322.06 164.43 186,32 1863.08 15.58 1.27 
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max 193.98 102.49 57.08 1750.08 273.38 286.58 2872.87 48.00 1.48 


XAU= GDX GLD 


min 1051.36 12.47 100.50 
mean 1349.01 33.57 130.09 
std 488.75 25.27 18.78 
median 1292.61 25.62 124.00 
max 1898.99 66.63 184.59 


The mean value per column. 

The minimum value per column. 
The mean value per column. 

The standard deviation per column. 


The median per column. 


© © © 8 8 8 


The maximum value per column. 


Using the aggregate() method also allows one to pass custom functions. 


Changes over Time 


Statistical analysis methods are often based on changes over time and not the abso- 
lute values themselves. There are multiple options to calculate the changes in a time 
series over time, including absolute differences, percentage changes, and logarithmic 
(log) returns. 


First, the absolute differences, for which pandas provides a special method: 


In [15]: data.diff().head() @ 


Out[15]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= 

Date 
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN 
2010-01-04 NaN NaN NaN NaN NaN NaN NaN NaN 0.0088 


2010-01-05 0.052857 0.010 -0.01 0.79 3.06 0.30 3.53 -0.69 -0.0043 
2010-01-06 -0.487142 -0.190 -0.07 -2.44 -1.88 0.08 0.62 -0.19 0.0044 
2010-01-07 -0.055714 -0.318 -0.20 -2.25 3.41 0.48 4.55 -0.10 -0.0094 


XAU= GDX GLD 
Date 
2010-01-01 NaN NaN NaN 
2010-01-04 23.65 NaN NaN 
2010-01-05 -1.35 0.46 -0.10 
2010-01-06 19.85 1.17 1.81 
2010-01-07 -6.60 -0.24 -0.69 
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In [16]: data.diff(). 


Out[16]: AAPL. 
MSFT. 
INTC. 
AMZN. 
GS.N 
SPY 
.SPX 
VIX 
EUR= 
XAU= 
GDX 
GLD 


0 


© OO 


«064737 
- 031246 
-013540 
. 706608 
+ 028224 
-072103 
. 732659 
.019583 
- 000119 
- 041887 
-0.015071 
-0.003455 


9: RS a ee O 


dtype: float64 


mean() (2) 


@ diff() provides the absolute changes between two index values. 


© Ofcourse, aggregation operations can be applied in addition. 


From a statistics point of view, absolute changes are not optimal because they are 
dependent on the scale of the time series data itself. Therefore, percentage changes 
are usually preferred. The following code derives the percentage changes or percent- 
age returns (also: simple returns) in a financial context and visualizes their mean val- 
ues per column (see Figure 8-2): 


In [17]: data.pct_change().round(3).head() (1) 


Out[17]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY  .SPX  .VIX  EUR= 
Date 
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN 
2010-01-04 NaN NaN NaN NaN NaN NaN NaN NaN 0.006 
2010-01-05 0.002 0.000 -0.000 0.006 0.018 0.003 0.003 -0.034 -0.003 
2010-01-06 -0.016 -0.006 -0.003 -0.018 -0.011 0.001 0.001 -0.010 0.003 
2010-01-07 -0.002 -0.010 -0.010 -0.017 0.020 0.004 0.004 -0.005 -0.007 
XAU= GDX GLD 
Date 
2010-01-01 NaN NaN NaN 
2010-01-04 0.022 NaN NaN 
2010-01-05 -0.001 0.010 -0.001 
2010-01-06 0.018 0.024 0.016 
2010-01-07 -0.006 -0.005 -0.006 
In [18]: data.pct_change().mean().plot(kind='bar', figsize=(10, 6)); (2) 
@ pct_change() calculates the percentage change between two index values. 
© The mean values of the results are visualized as a bar plot. 
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Figure 8-2. Mean values of percentage changes as bar plot 


As an alternative to percentage returns, log returns can be used. In some scenarios, 
they are easier to handle and therefore often preferred in a financial context. 
Figure 8-3 shows the cumulative log returns for the single financial time series. This 
type of plot leads to some form of normalization: 


In [19]: rets = np.log(data / data.shift(1)) 1 


In [20]: rets.head().round(3) (2) 


Out[20]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY . SPX VIX EUR= 

Date 
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN 


2010-01-04 NaN NaN NaN NaN NaN NaN NaN NaN 0.006 
2010-01-05 0.002 0.000 -0.000 0.006 0.018 0.003 0.003 -0.035 -0.003 
2010-01-06 -0.016 -0.006 -0.003 -0.018 -0.011 0.001 0.001 -0.010 0.003 
2010-01-07 -0.002 -0.010 -0.010 -0.017 0.019 0.004 0.004 -0.005 -0.007 


XAU= GDX GLD 
Date 
2010-01-01 NaN NaN NaN 
2010-01-04 0.021 NaN NaN 
2010-01-05 -0.001 0.010 -0.001 


2 One of the advantages is additivity over time, which does not hold true for simple percentage changes/ 
returns. 
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2010-01-06 0.018 0.024 0.016 
2010-01-07 -0.006 -0.005 -0.006 


In [21]: rets.cumsum().apply(np.exp).plot(figsize=(10, 6)); 
Calculates the log returns in vectorized fashion. 


A subset of the results. 


then np.exp() is applied to the results. 


© 


Plots the cumulative log returns over time; first the cumsum() method is called, 


AAPL.O 
MSFT.O 
INTC.O 
AMZN.O 
GS.N 
SPY 
«SPX 
-VIX 
EUR= 
XAU= 
GDX 
GLD 


10 


2010 2011 2012 2013 2014 


Date 


2015 2016 


2017 2018 


Figure 8-3. Cumulative log returns over time 


Resampling 


Resampling is an important operation on financial time series data. Usually this takes 
the form of downsampling, meaning that, for example, a tick data series is resampled 
to one-minute intervals or a time series with daily observations is resampled to one 


with weekly or monthly observations (as shown in Figure 8-4): 


In [22]: data.resample('1w', label='right').last().head() 
Out[22]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N 
Date 
2010-01-03 NaN NaN NaN NaN NaN 
2010-01-10 30.282827 30.66 70.83. 133.52 174.31 
2010-01-17 29.418542 30.86 20.80 127.14 165.21 


SPY . SPX 

NaN 
114,57 
113.64 


NaN 
1144.98 
1136.03 


- VIX 


NaN 
18.13 
17/91 
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2010-01-24 28.249972 28.96 19.91 121.43 154.12 109.21 1091.76 27.31 
2010-01-31 27.437544 28.18 19.40 125.41 148.72 107.39 1073.87 24.62 


EUR= XAU= GDX GLD 
Date 
2010-01-03 1.4323 1096.35 NaN NaN 
2010-01-10 1.4412 1136.10 49.84 111.37 
2010-01-17 1.4382 1129.90 47.42 110.86 
2010-01-24 1.4137 1092.60 43.79 107.17 
2010-01-31 1.3862 1081.05 40.72 105.96 


In [23]: data.resample('1m', label='right').last().head() (2) 
Out[23]: 
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY SPX \ 

Date 
2010-01-31 27.437544 28.1800 19.40 125.41 148.72 107.3900 1073.87 
2010-02-28 29.231399 28.6700 20.53 118.40 156.35 110.7400 1104.49 
2010-03-31 33.571395 29.2875 22.29 135.77 170.63 117.0000 1169.43 
2010-04-30 37.298534 30.5350 22.84 137.10 145.20 118.8125 1186.69 
2010-05-31 36.697106 25.8000 21.42 125.46 144.26 109.3690 1089.41 


VIX EUR= XAU= GDX GLD 
Date 
2010-01-31 24.62 1.3862 1081.05 40.72 105.960 
2010-02-28 19.50 1.3625 1116.10 43.89 109.430 
2010-03-31 17.59 1.3510 1112.80 44.41 108.950 
2010-04-30 22.05 1.3295 1178.25 50.51 115.360 
2010-05-31 32.07 1.2305 1215.71 49.86 118.881 


In [24]: rets.cumsum().apply(np.exp). resample('1m', Label='right').last( 
).plot(figsize=(10, 6)); © 


@ EOD data gets resampled to weekly time intervals ... 
© ... and monthly time intervals. 


© This plots the cumulative log returns over time: first, the cumsum() method is 
called, then np.exp() is applied to the results; finally, the resampling takes place. 
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Figure 8-4. Resampled cumulative log returns over time (monthly) 


Avoiding Foresight Bias 


When resampling, pandas takes by default in many cases the left 
label (or index value) of the interval. To be financially consistent, 
make sure to use the right label (index value) and in general the last 
available data point in the interval. Otherwise, a foresight bias 
might sneak into the financial analysis.* 


Rolling Statistics 


It is financial tradition to work with rolling statistics, often also called financial indi- 
cators or financial studies. Such rolling statistics are basic tools for financial chartists 
and technical traders, for example. This section works with a single financial time 
series only: 


In [25]: sym = 'AAPL.O' 
In [26]: data = pd.DataFrame(data[sym]).dropna() 


In [27]: data.tail() 


3 Foresight bias—or, in its strongest form, perfect foresight—means that at some point in the financial analysis, 
data is used that only becomes available at a later point. The result might be “too good” results, for example, 
when backtesting a trading strategy. 
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Out[27]: AAPL.O 
Date 
2018-06-25 182.17 
2018-06-26 184.43 
2018-06-27 184.16 
2018-06-28 185.50 
2018-06-29 185.11 


An Overview 

It is straightforward to derive standard rolling statistics with pandas: 

In [28]: window = 20 (1) 

In [29]: data['min'] = data[sym].rolling(window=window).min() (2) 

In [30]: data['mean'] = data[sym].rolling(window=window).mean() © 

In [31]: data['std'] = data[sym].rolling(window=window).std() (4) 

In [32]: data['median'] = data[sym].rolling(window=window).median() (5) 
In [33]: data['max'] = data[sym].rolling(window=window) .max() Q 

In [34]: data['ewma'] = data[sym].ewm(halflife=0.5, min_periods=window).mean() (7) 
Defines the window; i.e., the number of index values to include. 
Calculates the rolling minimum value. 

Calculates the rolling mean value. 

Calculates the rolling standard deviation. 


Calculates the rolling median value. 


Calculates the rolling maximum value. 


© © © 6 © 8 8 


Calculates the exponentially weighted moving average, with decay in terms of a 
half life of 0.5. 


To derive more specialized financial indicators, additional packages are generally 
needed (see, for instance, the financial plots with Cufflinks in “Interactive 2D Plot- 
ting” on page 195). Custom ones can also easily be applied via the apply() method. 


The following code shows a subset of the results and visualizes a selection of the cal- 
culated rolling statistics (see Figure 8-5): 


In [35]: data.dropna().head() 
Out[35]: 
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AAPL.O 
Date 
2010-02-01 27.818544 
2010-02-02 27.979972 
2010-02-03 28.461400 
2010-02-04 27.435687 
2010-02-05 27.922829 


ewma 
Date 

2010-02-01 27.805432 
2010-02-02 27.936337 
2010-02-03 28.330134 
2010-02-04 27.659299 
2010-02-05 27.856947 


min 


27 . 437544 
27 . 437544 
27.437544 
27.435687 
27.435687 


mean 


29. 580892 
29.451249 
29.343035 
29.207892 
29.099892 


std 


0.933650 29. 
0.968048 29. 
0.950665 29. 
1.029129 29, 
1.037811. 29). 


In [36]: ax = data[['min', 'mean', 'max']].iloc[-200:].plot( 


figsize=(10, 6), style=['g--', 'r--', 


data[sym].iloc[-200:].plot(ax=ax, lw=2.0); (2) 


@ Plots three rolling statistics for the final 200 data rows. 


© Adds the original time series data to the plot. 


median 


821542 
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Figure 8-5. Rolling statistics for minimum, mean, maximum values 
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A Technical Analysis Example 


Rolling statistics are a major tool in the so-called technical analysis of stocks, as com- 
pared to the fundamental analysis which focuses, for instance, on financial reports 
and the strategic positions of the company whose stock is being analyzed. 


A decades-old trading strategy based on technical analysis is using two simple moving 
averages (SMAs). The idea is that the trader should go long on a stock (or financial 
instrument in general) when the shorter-term SMA is above the longer-term SMA 
and should go short when the opposite holds true. The concepts can be made precise 
with pandas and the capabilities of the DataFrame object. 


Rolling statistics are generally only calculated when there is enough data given the 
window parameter specification. As Figure 8-6 shows, the SMA time series only start 
at the day for which there is enough data given the specific parameterization: 
In [37]: data['SMA1'] = data[sym].rolling(window=42) .mean() 1) 
In [38]: data['SMA2'] = data[sym].rolling(window=252) .mean() (2) 
In [39]: data[[sym, 'SMA1', 'SMA2']].tail() 
Out[39]: AAPL.O SMA1 SMA2 
Date 
2018-06-25 182.17 185.606190 168.265556 
2018-06-26 184.43 186.087381 168.418770 
2018-06-27 184.16 186.607381 168.579206 


2018-06-28 185.50 187.089286 168.736627 
2018-06-29 185.11 187.470476 168.901032 


In [40]: data[[sym, 'SMA1', 'SMA2']].plot(figsize=(10, 6)); © 
@ Calculates the values for the shorter-term SMA. 
© Calculates the values for the longer-term SMA. 


© Visualizes the stock price data plus the two SMA time series. 
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Figure 8-6. Apple stock price and two simple moving averages 


In this context, the SMAs are only a means to an end. They are used to derive posi- 
tions to implement a trading strategy. Figure 8-7 visualizes a long position by a value 
of 1 and a short position by a value of -1. The change in the position is triggered (vis- 
ually) by a crossover of the two lines representing the SMA time series: 


In [41]: data.dropna(inplace=True) (13 
In [42]: data['positions'] = np.where(data['SMA1'] > data['SMA2'], (2) 
Pa (4) 
In [43]: ax = data[[sym, 'SMA1', 'SMA2', 'positions']].plot(figsize=(10, 6), 
secondary_y='positions') 


ax.get_legend().set_bbox_to_anchor((0.25, 0.85)); 


Only complete data rows are kept. 
If the shorter-term SMA value is greater than the longer-term one ... 


... go long on the stock (put a 1). 


© © 8 8 


Otherwise, go short on the stock (put a -1). 
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Figure 8-7. Apple stock price, two simple moving averages and positions 


The trading strategy implicitly derived here only leads to a few trades per se: only 
when the position value changes (i.e., a crossover happens) does a trade take place. 
Including opening and closing trades, this would add up to just six trades in total. 


Correlation Analysis 


As a further illustration of how to work with pandas and financial time series data, 
consider the case of the S&P 500 stock index and the VIX volatility index. It is a styl- 
ized fact that when the S&P 500 rises, the VIX falls in general, and vice versa. This is 
about correlation and not causation. This section shows how to come up with some 
supporting statistical evidence for the stylized fact that the S&P 500 and the VIX are 
(highly) negatively correlated.‘ 


The Data 


The data set now consists of two financial times series, both visualized in Figure 8-8: 


In [44]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True) (1) 


In [45]: data = raw[['.SPX', '.VIX']].dropna() 


4 One reason behind this is that when the stock index comes down—during a crisis, for instance—trading vol- 
ume goes up, and therewith also the volatility. When the stock index is on the rise, investors generally are 
calm and do not see much incentive to engage in heavy trading. In particular, long-only investors then try to 
ride the trend even further. 
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In [46]: data.tail() 

Out[46]: . SPX . VIX 
Date 
2018-06-25 2717.07 17.33 
2018-06-26 2723.06 15.92 
2018-06-27 2699.63 17.91 
2018-06-28 2716.31 16.85 
2018-06-29 2718.37 16.09 


In [47]: data.plot(subplots=True, figsize=(10, 6)); 


@ Reads the EOD data (originally from the Thomson Reuters Eikon Data API) 
from a CSV file. 
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Figure 8-8. S&P 500 and VIX time series data (different subplots) 


When plotting (parts of) the two time series in a single plot and with adjusted scal- 
ings, the stylized fact of negative correlation between the two indices becomes evident 
through simple visual inspection (Figure 8-9): 


In [48]: data.loc[:'2012-12-31'].plot(secondary_y='.VIX', figsize=(10, 6)); 1) 


@ .loc[:DATE] selects the data until the given value DATE. 
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Figure 8-9. S&P 500 and VIX time series data (same plot) 


Logarithmic Returns 


As pointed out earlier, statistical analysis in general relies on returns instead of abso- 
lute changes or even absolute values. Therefore, we'll calculate log returns first before 
any further analysis takes place. Figure 8-10 shows the high variability of the log 
returns over time. For both indices so-called “volatility clusters” can be spotted. In 
general, periods of high volatility in the stock index are accompanied by the same 


phenomena in the volatility index: 


In [49]: rets = np.log(data / data.shift(1)) 


In [50]: rets.head() 

Out[50]: . SPX . VIX 
Date 
2010-01-04 NaN NaN 
2010-01-05 0.003111 -0.035038 
2010-01-06 0.000545 -0.009868 
2010-01-07 0.003993 -0.005233 
2010-01-08 0.002878 -0.050024 


In [51]: rets.dropna(inplace=True) 


In [52]: rets.plot(subplots=True, figsize=(10, 6)); 
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Figure 8-10. Log returns of the S&P 500 and VIX over time 


In such a context, the pandas scatter_matrix() plotting function comes in handy 
for visualizations. It plots the log returns of the two series against each other, and one 
can add either a histogram or a kernel density estimator (KDE) on the diagonal (see 
Figure 8-11): 
In [53]: pd.plotting.scatter_matrix(rets, (1) 

alpha=0.2, (2) 

diagonal='hist', © 

hist_kwds={'bins': 35}, 4] 

figsize=(10, 6)); 


The data set to be plotted. 
The alpha parameter for the opacity of the dots. 


What to place on the diagonal; here: a histogram of the column data. 


© © 8 8 


Keywords to be passed to the histogram plotting function. 
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Figure 8-11. Log returns of the S&P 500 and VIX as a scatter matrix 


OLS Regression 


With all these preparations, an ordinary least-squares (OLS) regression analysis is 
convenient to implement. Figure 8-12 shows a scatter plot of the log returns and the 
linear regression line through the cloud of dots. The slope is obviously negative, pro- 
viding support for the stylized fact about the negative correlation between the two 
indices: 


In [54]: reg = np.polyfit(rets['.SPX'], rets['.VIX'], deg=1) (1) 


In [55]: ax = rets.plot(kind='scatter', x='.SPX', y='.VIX', figsize=(10, 6)) (2) 
ax.plot(rets['.SPX'], np.polyval(reg, rets['.SPX']), 'r', lw=2); © 


@ This implements a linear OLS regression. 
© This plots the log returns as a scatter plot ... 


© ... to which the linear regression line is added. 


226 | Chapter 8: Financial Time Series 


0.8 


0.6 


0.4 


VIX 


0.2 


0.0 


—0.2 


—0.4 


—0.06 —0.04 —0.02 0.00 0.02 0.04 
.SPX 


Figure 8-12. Log returns of the S&P 500 and VIX as a scatter matrix 


Correlation 


Finally, we consider correlation measures directly. Two such measures are consid- 
ered: a static one taking into account the complete data set and a rolling one showing 
the correlation for a fixed window over time. Figure 8-13 illustrates that the correla- 
tion indeed varies over time but that it is always, given the parameterization, nega- 
tive. This provides strong support for the stylized fact that the S&P 500 and the VIX 
indices are (strongly) negatively correlated: 


In [56]: rets.corr() (1) 

Out[56]: . SPX VIX 
.SPX 1.000000 -0.804382 
.VIX -0.804382 1.000000 


In [57]: ax = rets['.SPX'].rolling(window=252).corr( 
rets['.VIX']).plot(figsize=(10, 6)) (2) 
ax.axhline(rets.corr().iloc[0, 1], c='r'); 
@ The correlation matrix for the whole DataFrame. 


@ This plots the rolling correlation over time ... 


© ... and adds the static value to the plot as horizontal line. 


Correlation Analysis | 227 


—0.725 


-0.750 


-0.775 


—0.800 


-0.825 


—0.850 


—0.875 


—0.900 


NA 


4 > 5 b 6 4 è 
o> 49> o> o> DNA w o> 
Date 


Figure 8-13. Correlation between S&P 500 and VIX (static and rolling) 


High-Frequency Data 


This chapter is about financial time series analysis with pandas. Tick data sets are a 
special case of financial time series. Frankly, they can be handled more or less in the 
same ways as, for instance, the EOD data set used throughout this chapter so far. 
Importing such data sets also is quite fast in general with pandas. The data set used 
comprises 17,352 data rows (see also Figure 8-14): 


In [59]: 


In [60]: 


In [61]: 


In [62]: 


%%time 

# data from FXCM Forex Capital Markets Ltd. 

tick = pd.read_csv('../../source/fxcm_eur_usd_tick_data.csv', 
index_col=0, parse_dates=True) 

CPU times: user 1.07 s, sys: 149 ms, total: 1.22 s 

Wall time: 1.16 s 


tick.info() 

<class 'pandas.core.frame.DataFrame'> 

DatetimeIndex: 461357 entries, 2018-06-29 00:00:00.082000 to 2018-06-29 
20:59:00.607000 

Data columns (total 2 columns): 

Bid 461357 non-null float64 

Ask 461357 non-null float64 

dtypes: float64(2) 

memory usage: 10.6 MB 


tick['Mid'] = tick.mean(axis=1) (1) 


tick['Mid'].plot(figsize=(10, 6)); 
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@ Calculates the Mid price for every data row. 
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Figure 8-14. Tick data for EUR/USD exchange rate 


Working with tick data is generally a scenario where resampling of financial time ser- 
ies data is needed. The code that follows resamples the tick data to five-minute bar 
data (see Figure 8-15), which can then be used, for example, to backtest algorithmic 
trading strategies or to implement a technical analysis: 


In [63]: tick_resam = tick.resample(rule='5min', lLabel='right').last() 


In [64]: tick_resam.head() 

Out[64]: Bid Ask Mid 
2018-06-29 00:05:00 1.15649 1.15651 1.156500 
2018-06-29 00:10:00 1.15671 1.15672 1.156715 
2018-06-29 00:15:00 1.15725 1.15727 1.157260 
2018-06-29 00:20:00 1.15720 1.15722 1.157210 
2018-06-29 00:25:00 1.15711 1.15712 1.157115 


In [65]: tick_resam['Mid'].plot(figsize=(10, 6)); 
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Figure 8-15. Five-minute bar data for EUR/USD exchange rate 


Conclusion 


This chapter deals with financial time series, probably the most important data type 
in the financial field. pandas is a powerful package to deal with such data sets, allow- 
ing not only for efficient data analyses but also easy visualizations, for instance. 
pandas is also helpful in reading such data sets from different sources as well as in 
exporting the data sets to different technical file formats. This is illustrated in the sub- 
sequent chapter. 


Further Resources 


Good references in book form for the topics covered in this chapter are: 


e McKinney, Wes (2017). Python for Data Analysis. Sebastopol, CA: O’Reilly. 


e VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 
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CHAPTER 9 
Input/Output Operations 


It is a capital mistake to theorize before one has data. 


—Sherlock Holmes 


As a general rule, the majority of data, be it in a finance context or any other applica- 
tion area, is stored on hard disk drives (HDDs) or some other form of permanent 
storage device, like solid state disks (SSDs) or hybrid disk drives. Storage capacities 
have been steadily increasing over the years, while costs per storage unit (e.g., per 
megabyte) have been steadily falling. 


At the same time, stored data volumes have been increasing at a much faster pace 
than the typical random access memory (RAM) available even in the largest 
machines. This makes it necessary not only to store data to disk for permanent stor- 
age, but also to compensate for lack of sufficient RAM by swapping data from RAM 
to disk and back. 


Input/output (I/O) operations are therefore important tasks when it comes to finance 
applications and data-intensive applications in general. Often they represent the bot- 
tleneck for performance-critical computations, since I/O operations cannot typically 
shuffle data fast enough to the RAM! and from the RAM to the disk. In a sense, CPUs 
are often “starving” due to slow I/O operations. 


Although the majority of today’s financial and corporate analytics efforts are con- 
fronted with big data (e.g., of petascale size), single analytics tasks generally use data 
subsets that fall in the “mid” data category. A study by Microsoft Research concludes: 


1 Here, no distinction is made between different levels of RAM and processor caches. The optimal use of cur- 
rent memory architectures is a topic in itself. 
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Our measurements as well as other recent work shows that the majority of real-world 
analytic jobs process less than 100 GB of input, but popular infrastructures such as 
Hadoop/MapReduce were originally designed for petascale processing. 


—Appuswamy et al. (2013) 


In terms of frequency, single financial analytics tasks generally process data of not 
more than a couple of gigabytes (GB) in size—and this is a sweet spot for Python and 
the libraries of its scientific stack, such as NumPy, pandas, and PyTables. Data sets of 
such a size can also be analyzed in-memory, leading to generally high speeds with 
today’s CPUs and GPUs. However, the data has to be read into RAM and the results 
have to be written to disk, meanwhile ensuring that today’s performance require- 
ments are met. 


This chapter addresses the following topics: 


“Basic I/O with Python” on page 232 
Python has built-in functions to serialize and store any object on disk and to read 
it from disk into RAM; apart from that, Python is strong when it comes to work- 
ing with text files and SQL databases. NumPy also provides dedicated functions for 
fast binary storage and retrieval of ndarray objects. 


“I/O with pandas” on page 244 
The pandas library provides a plenitude of convenience functions and methods 
to read data stored in different formats (e.g., CSV, JSON) and to write data to 
files in diverse formats. 


“I/O with PyTables” on page 252 
PyTables uses the HDF5 standard with hierarchical database structure and 
binary storage to accomplish fast I/O operations for large data sets; speed often is 
only bound by the hardware used. 


“I/O with TsTables” on page 267 
TsTables is a package that builds on top of PyTables and allows for fast storage 
and retrieval of time series data. 


Basic 1/0 with Python 


Python itself comes with a multitude of I/O capabilities, some optimized for perfor- 
mance, others more for flexibility. In general, however, they are easily used in inter- 
active as well as in production settings. 


Writing Objects to Disk 


For later use, for documentation, or for sharing with others, one might want to store 
Python objects on disk. One option is to use the pickle module. This module can 
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serialize the majority of Python objects. Serialization refers to the conversion of an 
object (hierarchy) to a byte stream; deserialization is the opposite operation. 


As usual, some imports and customizations with regard to plotting first: 
In [1]: from pylab import plt, mpl 
plt.style.use('seaborn') 


mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


The example that follows works with (pseudo-)random data, this time stored in a 
list object: 


In [2]: import pickle (13 

import numpy as np 

from random import gauss (2) 
In [3]: a = [gauss(1.5, 2) for i in range(1000000)] © 
In [4]: path = '/Users/yves/Temp/data/' (4) 
In [5]: pkl_file = open(path + 'data.pkl', 'wb') (5) 
Imports the pickle module from the standard library. 


Import gauss to generate normally distributed random numbers. 


Creates a larger list object with random numbers. 


o © 8 Ọ 


Specifies the path where to store the data files. 


Opens a file for writing in binary mode (wb). 


The two major functions to serialize and deserialize Python objects are 
pickle.dump(), for writing objects, and pickle.load(), for loading them into 
memory: 

In [6]: %time pickle.dump(a, pkl_file) (1 


CPU times: user 37.2 ms, sys: 15.3 ms, total: 52.5 ms 
Wall time: 50.8 ms 


In [7]: pkl_file.close() (2) 

In [8]: Ul $path* © 
-rw-r--r-- 1 yves staff 9002006 Oct 19 12:11 
/Users/yves/Temp/data/data.pkl 

In [9]: pkl_file = open(path + 'data.pkl', 'rb') (4) 


In [10]: %time b = pickle.load(pkl_file) © 
CPU times: user 34.1 ms, sys: 16.7 ms, total: 50.8 ms 
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Wall time: 48.7 ms 


In [11]: a[:3] 
Out[11]: [6.517874180585469, -0.5552400459507827, 2.8488946310833096] 


In [12]: b[:3] 
Out[12]: [6.517874180585469, -0.5552400459507827, 2.8488946310833096 | 


In [13]: np.allclose(np.array(a), np.array(b)) 6] 
Out[13]: True 


Serializes the object a and saves it to the file. 
Closes the file. 

Shows the file on disk and its size (Mac/Linux). 
Opens the file for reading in binary mode (rb). 


Reads the object from disk and deserializes it. 


© © 6 O 8 8 


Converting a and b to ndarrary objects, np.allclose() verifies that both con- 
tain the same data (numbers). 


Storing and retrieving a single object with pickle obviously is quite simple. What 
about two objects? 


In [14]: pkl_file = open(path + 'data.pkl', 'wb') 

In [15]: %time pickle.dump(np.array(a), pkl_file) (1) 
CPU times: user 58.1 ms, sys: 6.09 ms, total: 64.2 ms 
Wall time: 32.5 ms 

In [16]: %time pickle.dump(np.array(a) ** 2, pkl_file) (2) 
CPU times: user 66.7 ms, sys: 7.22 ms, total: 73.9 ms 
Wall time: 39.3 ms 

In [17]: pkl_file.close() 

In [18]: Ul $path* © 


-rw-r--r-- 1 yves staff 16000322 Oct 19 12:11 
/Users/yves/Temp/data/data.pkl 


Serializes the ndarray version of a and saves it. 
Serializes the squared ndarray version of a and saves it. 


The file now has roughly double the size from before. 


234 | Chapter 9: Input/Output Operations 


What about reading the two ndarray objects back into memory? 


In [19]: pkl_file = open(path + 'data.pkl', 'rb') 


In [20]: x = pickle. load(pkl_file) (13 
x[:4] 
Out[20]: array([ 6.51787418, -0.55524005, 2.84889463, 5.94489175]) 


In [21]: y = pickle.load(pkl_file) (2) 
y[:4] 
Out[21]: array([42.48268383, 0.30829151, 8.11620062, 35.34173791]) 


In [22]: pkl_file.close() 


@ This retrieves the object that was stored first. 


@ This retrieves the object that was stored second. 


Obviously, pickle stores objects according to the first in, first out (FIFO) principle. 
There is one major problem with this: there is no metainformation available to the 
user to know beforehand what is stored in a pickle file. 


A sometimes helpful workaround is to not store single objects, but a dict object con- 
taining all the other objects: 


In [23]: pkl_file = open(path + 'data.pkl', 'wb') 
pickle.dump({'x': x, 'y': y}, pkl_file) (1) 
pkl_file.close() 


In [24]: pkl_file = open(path + 'data.pkl', 'rb') 
data = pickle. load(pkl_file) (2) 
pkl_file.close() 
for key in data.keys(): 
print(key, data[key][:4]) 
x [ 6.51787418 -0.55524005 2.84889463 5.94489175] 
y [42.48268383 0.30829151 8.11620062 35.34173791] 


In [25]: !rm -f Spath* 
@ Stores a dict object containing the two ndarray objects. 


© Retrieves the dict object. 


This approach requires writing and reading all the objects at once, but this is a com- 
promise one can probably live with in many circumstances given the higher conve- 
nience it brings along. 
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Compatibility Issues 


The use of pickle for the serialization of objects is generally 
straightforward. However, it might lead to problems when, e.g., a 
Python package is upgraded and the new version of the package 
cannot work anymore with the serialized object from the older ver- 
sion. It might also lead to problems when sharing such an object 
across platforms and operating systems. It is therefore in general 
advisable to work with the built-in reading and writing capabilities 
of the packages such as NumPy and pandas that are discussed in the 


following sections. 


Reading and Writing Text Files 


Text processing can be considered a strength of Python. In fact, many corporate and 
scientific users use Python for exactly this task. With Python one has multiple 


options to work with str objects, as well as with text files in general. 


Assume the case of quite a large set of data that shall be shared as a CSV file. 
Although such files have a special internal structure, they are basically plain text files. 
The following code creates a dummy data set as an ndarray object, creates a Dateti 


meIndex object, combines the two, and stores the data as a CSV text file: 


In [26]: 


In [27]: 


In [28]: 
Out[28]: 


In [29]: 


In [30]: 
Out[30]: 


import pandas as pd 


rows = 5000 @ 


a = np.random.standard_normal((rows, 5)).round(4) (2) 


a © 


array([[-0.0892, -1.0508, - 


[ 2.1046, 3.2623, 
[-0.0482, -0.9221, 


2 


[ 0.3026, -0.2005, - 
[-0.7031, -0.6989, - 


[ 2.4573, 2.24054, 


© 


© 


. 5942, 
. 704 
«1332; 


.9947, 
-8031, 
.158 


0.3367, 1.508 ], 


» -0.2651, 0.4461], 


0.1192, 0.7782], 


1.0203, -0.6578], 
-0.4271, 1.9963], 


» 70.7039, -1.0337]] 


) 


t = pd.date_range(start='2019/1/1', periods=rows, freq= 


t © 

DatetimeIndex(['2019-01-01 
'2019-01-01 
'2019-01-01 
'2019-01-01 
'2019-01-01 


'2019-07-27 
'2019-07-28 
'2019-07-28 
'2019-07-28 


00: 
02: 
04: 
06: 
08: 


222 
00: 
02: 
04: 


00 


00: 
00: 
00: 
00: 
00: 


00: 
00: 
00: 
:00', 


00"; 
00', 
00', 
00'," 
00',", 


00', 


00"; 
00; 


'2019-01-01 01: 
'2019-01-01 03: 
'2019-01-01 05: 
'2019-01-01 07: 
'2019-01-01 09: 


"2019-07-27 23: 
'2019-07-28 01: 
'2019-07-28 03: 
"2019-07-28 05: 


00: 
00: 
00: 
00: 
00: 


00: 
00: 
00: 
200", 


00 


H) © 


00', 
00%; 
00"; 
00", 
00', 


00', 
00", 
00"; 
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In [31]: 
In [32]: 


In [33]: 
Out[33]: 


In [34]: 


In [35]: 


In [36]: 


O © 6&6 O O O 8 Ọ 


'2019-07-28 06:00:00', '2019-07-28 07:00:00'], 
dtype='datetime64[ns]', length=5000, freq='H') 


csv_file = open(path + 'data.csv', 'w') (4) 
header = 'date,no1,no2,no3,n04,n05\n' (5) 


csv_file.write(header) (5) 
25 


for t_, (no1, no2, no3, no4, no5) in zip(t, a): 


s = '{},{},03,03,.{3,0)\n'.format(t_, noi, no2, no3, no4, no5) @ 


csv_file.write(s) 
csv_file.close() 
LL Spath* 


-rw-r--r-- 1 yves staff 284757 Oct 19 12:11 
/Users/yves/Temp/data/data.csv 


Defines the number of rows for the data set. 


Creates the ndarray object with the random numbers. 


Opens a file for writing (w). 


Combines the data row-wise ... 


... into str objects ... 


(6) 


Creates a DatetimeIndex object of appropriate length (hourly intervals). 


Defines the header row (column labels) and writes it as the first line. 


... and writes it line-by-line (appending to the CSV text file). 


The other way around works quite similarly. First, open the now-existing CSV file. 
Second, read its content line-by-line using the .readline() or .readlines() meth- 
ods of the file object: 


In [37]: 


In [38]: 


csv_file = open(path + 'data.csv', 'r') (1) 


for i in range(5): 
print(csv_file.readline(), end='') @ 
date,no1,no2,n03,n04,no5 


2019-01-01 00:00:00, -0.0892,-1.0508, -0.5942,0.3367,1.508 
2019-01-01 01:00:00,2.1046,3.2623,0.704,-0.2651,0.4461 

2019-01-01 02:00:00, -0.0482,-0.9221,0.1332,0.1192,0.7782 
2019-01-01 03:00:00, -0.359,-2.4955,0.6164,0.712,-1.4328 
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1] 
(2) 
© 
(4) 


In [39]: csv_file.close() 

In [40]: csv_file = open(path + 'data.csv', 'r') (1) 

In [41]: content = csv_file.readlines() © 

In [42]: content[:5] (4) 

Out[42]: ['date,no1,no2,no3,no4,no5\n', 
'2019-01-01 00:00:00, -0.0892,-1.0508, -0.5942,0.3367,1.508\n', 
"2019-01-01 01:00:00,2.1046,3.2623,0.704,-0.2651,0.4461\n', 
"2019-01-01 02:00:00, -0.0482,-0.9221,0.1332,0.1192,0.7782\n', 
"2019-01-01 03:00:00, -0.359,-2.4955,0.6164,0.712,-1.4328\n' ] 


In [43]: csv_file.close() 

Opens the file for reading (r). 

Reads the file contents line-by-line and prints them. 
Reads the file contents in a single step ... 


... the result of which is a list object with all lines as separate str objects. 


CSV files are so important and commonplace that there is a csv module in the 
Python standard library that simplifies the processing of these files. Two helpful 
reader (iterator) objects of the csv module return either a list of list objects or a 
List of dict objects: 


In [44]: import csv 


In [45]: with open(path + 'data.csv', 'r') as f: 
csv_reader = csv.reader(f) (1) 
lines = [line for line in csv_reader] 


In [46]: lines[:5] (13 
Out[46]: [['date', 'no1', 'no2', 'no3', 'no4', 'noS'], 
['2019-01-01 00:00:00', '-0.0892', '-1.0508', '-@.5942', '0.3367', 
'1.508'], 
['2019-01-01 01:00:00', '2.1046', '3.2623', '0.704', '-0.2651', 
'0.4461'], 
['2019-01-01 02:00:00', '-0.0482', '-0.9221', '0.1332', '0.1192', 
'Q.7782'], 
['2019-01-01 03:00:00', '-0.359', '-2.4955', '0.6164', '0.712', 
'-1.4328']] 


In [47]: with open(path + 'data.csv', 'r') as f: 
csv_reader = csv.DictReader(f) (2) 


lines = [line for line in csv_reader] 


In [48]: lines[:3] (2) 
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Out[48]: [OrderedDict([('date', '2019-01-01 00:00:00'), 


In [49]: 


csv.reader() returns every single line as a List object. 


C'no1', 
('no2', 
('no3', 
('no4', 
('noS', 
OrderedDict([('date', 
C'no1', 
('no2', 
('no3', 
('no4', 
('noS', 
OrderedDict([('date', 
C'no1', 
('no2', 
('no3', 
('no4', 
('noS', 


Irm -f Spath* 


'-0.0892'), 

'-1.0508'), 

'-.5942'), 

'9.3367'), 

'1.508')]), 

2019-01-01 01:00:00'), 
'2.1046'), 

'3.2623'), 

'Q.704'), 

'-0.2651'), 

'@.4461')]), 

‘2019-01-01 02:00:00'), 
'-9.0482'), 

'-.9221'), 

'Q.1332'), 

'@.1192'), 

'@.7782')])] 


csv.DictReader() returns every single line as an OrderedDict, which is a special 
case of a dict object. 


Working with SQL Databases 


Python can work with any kind of Structured Query Language (SQL) database, and 
in general also with any kind of NoSQL database. One SQL or relational database that 
is delivered with Python by default is SQLite3. With it, the basic Python approach to 
SQL databases can be easily illustrated: 


In [50]: 
In [51]: 
In [52]: 


In [53]: 
Out[53]: 


In [54]: 


import sqlite3 as sq3 


con = sq3.connect(path 


+ 'numbs.db') (1) 


query = 'CREATE TABLE numbs (Date date, No1 real, No2 real)' @ 


con.execute(query) © 


<sqlite3.Cursor at 0x102655f10> 


con.commit() (4) 


2 For an overview of available database connectors for Python, visit https://wiki.python.org/moin/DatabaseInter 


faces. Instead of working directly with relational databases, object relational mappers such as SQLAlchemy 


often prove useful. They introduce an abstraction layer that allows for more Pythonic, object-oriented code. 
They also allow you to more easily exchange one relational database for another in the backend. 
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In [55]: q = con.execute (5) 
In [56]: q('SELECT * FROM sqlite_master').fetchall() (6) 
Out[56]: [('table', 
'numbs', 
'numbs', 
2; 
"CREATE TABLE numbs (Date date, No1 real, No2 real)')] 


Opens a database connection; a file is created if it does not exist. 
A SQL query that creates a table with three columns.’ 

Executes the query ... 

... and commits the changes. 


Defines a short alias for the con.execute() method. 
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Fetches metainformation about the database, showing the just-created table as 
the single object. 


Now that there is a database file with a table, this table can be populated with data. 
Each row consists of a datetime object and two float objects: 


In [57]: import datetime 


In [58]: now = datetime.datetime.now() 
q('INSERT INTO numbs VALUES(?, ?, ?)', (now, 0.12, 7.3)) (1) 
Out[58]: <sqlite3.Cursor at 0x102655f80> 


In [59]: np.random.seed(100) 
In [60]: data = np.random.standard_normal((10000, 2)).round(4) (2) 


In [61]: %%time 
for row in data: © 
now = datetime.datetime.now() 
q('INSERT INTO numbs VALUES(?, ?, ?)', (now, row[0], row[1])) 
con.commit() 
CPU times: user 115 ms, sys: 6.69 ms, total: 121 ms 
Wall time: 124 ms 


In [62]: q('SELECT * FROM numbs').fetchmany(4) (4) 

Out[62]: [('2018-10-19 12:11:15.564019', 0.12, 7.3), 
('2018-10-19 12:11:15.592956', -1.7498, 0.3427), 
('2018-10-19 12:11:15.593033', 1.153, -0.2524), 


3 See https://www.sqlite.org/lang.html for an overview of the SQLite3 language dialect. 
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('2018-10-19 12:11:15.593051', 0.9813, 0.5142)] 


In [63]: q('SELECT * FROM numbs WHERE noi > 0.5').fetchmany(4) (5) 

Out[63]: [('2018-10-19 12:11:15.593033', 1.153, -0.2524), 
('2018-10-19 12:11:15.593051', 0.9813, 0.5142), 
('2018-10-19 12:11:15.593104', 0.6727, -0.1044), 
('2018-10-19 12:11:15.593134', 1.619, 1.5416)] 


In [64]: pointer = q('SELECT * FROM numbs') Q 
In [65]: for i in range(3): 
print(pointer .fetchone()) @ 
("2018-10-19 12:11:15.564019', 0.12, 7.3) 
('2018-10-19 12:11:15.592956', -1.7498, 0.3427) 
('2018-10-19 12:11:15.593033', 1.153, -0.2524) 
In [66]: rows = pointer.fetchall() © 
rows[:3] 
Out[66]: [('2018-10-19 12:11:15.593051', 0.9813, 0.5142), 


("2018-10-19 12:11:15.593063",; 0.2212, -1.07), 
('2018-10-19 12:11:15.593073', -0.1895, 0.255)] 


Writes a single row (or record) to the numbs table. 
Creates a larger dummy data set as an ndarray object. 
Iterates over the rows of the ndarray object. 


Retrieves a number of rows from the table. 


Defines a pointer object ... 


oO 
(2) 
© 
(4) 
© The same, but with a condition on the values in the No1 column. 
16) 
@ ... that behaves like a generator object. 

© 


Retrieves all the remaining rows. 


Finally, one might want to delete the table object in the database if it’s not required 
anymore: 


In [67]: q('DROP TABLE IF EXISTS numbs') @ 
Out[67]: <sqlite3.Cursor at 0x1187a7420> 


In [68]: q('SELECT * FROM sqlite_master').fetchall() (2) 
Out[68]: [] 


In [69]: con.close() © 


In [70]: !rm -f Spath* (4) 
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Removes the table from the database. 
There are no table objects left after this operation. 


Closes the database connection. 


o © 8 8 


Removes the database file from disk. 
SQL databases are a rather broad topic; indeed, too broad and complex to be covered 


in any significant way in this chapter. The basic messages are: 


e Python integrates well with almost any database technology. 
e The basic SQL syntax is mainly determined by the database in use; the rest is 
what is called “Pythonic.” 


A few more examples based on SQLite3 are included later in this chapter. 


Writing and Reading NumPy Arrays 


NumPy itself has functions to write and read ndarray objects in a convenient and per- 
formant fashion. This saves effort in some circumstances, such as when converting 
NumPy dtype objects into specific database data types (e.g., for SQLite3). To illustrate 
that NumPy can be an efficient replacement for a SQL-based approach, the following 
code replicates the example from the previous section with NumPy. 


Instead of pandas, the code uses the np.arange() function of NumPy to generate an 
ndarray object with datetime objects stored: 


In [71]: dtimes = np.arange('2019-01-01 10:00:00', '2025-12-31 22:00:00', 
dtype='datetime64[m] ') (1) 


In [72]: len(dtimes) 
Out[72]: 3681360 


In [73]: dty = np.dtype([('Date', 'datetime64[m]'), 
('No1', 'f'), ('No2', 'f')]) @ 


In [74]: data = np.zeros(len(dtimes), dtype=dty) © 
In [75]: data['Date'] = dtimes (4) 
In [76]: a = np.random.standard_normal((len(dtimes), 2)).round(4) (5) 


In [77]: data['No1'] = a[:, 0] © 
data['No2'] = a[:, 1] Q 


In [78]: data.nbytes (7) 
Out[78]: 58901760 
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o 
(2) 
© 
(47 
© 
oO 
(7) 


Creates an ndarray object with datetime as the dtype. 


Defines the special dtype object for the structured array. 


Instantiates an ndarray object with the special dtype. 


Populates the Date column. 


The dummy data sets ... 


... which populate the No1 and No2 columns. 


The size of the structured array in bytes. 


Saving of ndarray objects is highly optimized and therefore quite fast. Almost 60 
MB of data takes a fraction of a second to save on disk (here using an SSD). A 
larger ndarray object with 480 MB of data takes about half a second to save on disk:* 


In [79]: 


In [80]: 


In [81]: 


Out[81]: 


In [82]: 


In [83]: 
Out[83]: 


%time np.save(path + 'array', data) (1) 
CPU times: user 37.4 ms, sys: 58.9 ms, total: 96.4 ms 
Wall time: 77.9 ms 


Ll Spath* @ 
-rw-r--r-- 1 yves staff 58901888 Oct 19 12:11 
/Users/yves/Temp/data/array.npy 


%time np.load(path + 'array.npy') © 
CPU times: user 1.67 ms, sys: 44.8 ms, total: 46.5 ms 
Wall time: 44.6 ms 


array([('2019-01-01T10:00', 1.5131, 0.6973), 
('2019-01-01T10:01', -1.722 , -0.4815), 
('2019-01-01T10:02', 0.8251; 0.3019), ..., 
("2025-12-31721:57', 1.372 , 0.6446), 
('2025-12-31T21:58', -1.2542, 0.1612), 
('2025-12-31T21:59', -1.1997, -1.097 )], 

dtype=[('Date', '<M8[m]'), ('No1', '<f4'), ('No2', '<f4')]) 


%time data = np.random.standard_normal((10000, 6000)).round(4) (4) 
CPU times: user 2.69 s, sys: 391 ms, total: 3.08 s 
Wall time: 2.78 s 


data.nbytes (4) 
480000000 


4 Note that such times might vary significantly even on the same machine when repeated multiple times, 
because they depend, among other factors, on what the machine is doing CPU-wise and I/O-wise at the same 
time. 
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In [84]: %time np.save(path + ‘'array', data) (4) 
CPU times: user 42.9 ms, sys: 300 ms, total: 343 ms 
Wall time: 481 ms 


In [85]: ll Spath* @ 
-rw-r--r-- 1 yves staff 480000128 Oct 19 12:11 
/Users/yves/Temp/data/array.npy 


In [86]: %time np.load(path + 'array.npy') (4) 
CPU times: user 2.32 ms, sys: 363 ms, total: 365 ms 
Wall time: 363 ms 


Out[86]: array([[ 0.3066, 0.5951, 0.5826, ..., 1.6773, 
[ 0.8769, 0.7292, -0.9557, ..., 0.5084, 
[-1.2202, -2.5509, -0.0575, ..., -1.6128, 


© 


.4294, -0.2216], 
.9635, -0.4443], 
.4662, -1.3645], 


© 
© 


© 


[-0.5598, 0.2393, -2.3716, ..., 1.7669, 
[ 0.273 , 0.8216, -0.0749, ..., -0.0552, = 
[-0.6305, 0.8334; 1.3702, ..., 0.3493, 


© 


.2462, 1.035 ], 
.8396, 0.3077], 
.1981, 0.2037]]) 


© 
© 


© 


In [87]: !rm -f Spath* 


This saves the structured ndarray object on disk. 


© 


The size on disk is hardly larger than in memory (due to binary storage). 


© 


This loads the structured ndarray object from disk. 


A larger regular ndarray object. 


These examples illustrate that writing to disk in this case is mainly hardware-bound, 
since the speeds observed represent roughly the advertised writing speed of standard 
SSDs at the time of this writing (about 500 MB/s). 


In any case, one can expect that this form of data storage and retrieval is faster when 
compared to SQL databases or using the pickle module for serialization. There are 
two reasons: first, the data is mainly numeric; second, NumPy uses binary storage, 
which reduces the overhead almost to zero. Of course, one does not have the func- 
tionality of a SQL database available with this approach, but PyTables will help in 
this regard, as subsequent sections show. 


I/O with pandas 


One of the major strengths of pandas is that it can read and write different data for- 
mats natively, including: 


e CSV (comma-separated values) 
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e SQL (Structured Query Language) 

e XLS/XSLX (Microsoft Excel files) 
JSON (JavaScript Object Notation) 

e HTML (HyperText Markup Language) 


Table 9-1 lists the supported formats and the corresponding import and export func- 
tions/methods of pandas and the DataFrame class, respectively. The parameters that, 
for example, the pd. read_csv() import function takes are described in the documen- 
tation for pandas.read_csv. 


Table 9-1. Import-export functions and methods 


Format Input Output Remark 

CSV pd.read_csv() . to_csv() Text file 

XLS/XLSX pd.read_excel() . to_excel() Spreadsheet 

HDF pd.read_hdf() .to_hdf() HDF5 database 

SQL pd.read_sql() .to_sql() SQL table 

JSON pd.read_json() . to_json() JavaScript Object Notation 
MSGPACK pd.read_msgpack() . to_msgpack() Portable binary format 
HTML pd.read_html() .to_html() HTML code 

GBQ pd.read_gbq() . to_gbq() Google Big Query format 
DTA pd.read_stata() .to_stata() Formats 104, 105, 108, 113-115, 117 
Any pd.read_clipboard() .to_clipboard() E.g. from HTML page 
Any pd.read_pickle() . to_pickle() (Structured) Python object 


The test case is again a larger set of float objects: 


In [88]: data = np.random.standard_normal((1000000, 5)).round(4) 


In [89]: data[:3] 

Out[89]: array([[ 0.4918, 1.3707, 0.137 , 0.3981, -1.0059], 
[ 0.4516, 1.4445, 0.0555, -0.0397, 0.44 ], 
[ 0.1629, -0.8473, -0.8223, -0.4621, -0.5137]]) 


To this end, this section also revisits SQLite3 and compares the performance to alter- 
native formats using pandas. 


Working with SQL Databases 
All that follows with regard to SQLite3 should be familiar by now: 


In [90]: filename = path + 'numbers' 


In [91]: con = sq3.Connection(filename + '.db') 
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In [92]: query = 'CREATE TABLE numbers (No1 real, No2 real,\ 
No3 real, No4 real, No5 real)' 


In [93]: q = con.execute 
qm = con.executemany 


In [94]: q(query) 
Out[94]: <sqlite3.Cursor at 0x1187a76c0> 


@ Creates a table with five columns for real numbers (float objects). 


This time, the .executemany() method can be applied since the data is available in a 
single ndarray object. Reading and working with the data works as before. Query 
results can also be visualized easily (see Figure 9-1): 


In [95]: %%time 
qm('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', data) (1) 
con.commit() 
CPU times: user 7.3 s, sys: 195 ms, total: 7.49 s 
Wall time: 7.71 s 


In [96]: Ul $path* 
-rw-r--r-- 1 yves staff 52633600 Oct 19 12:11 
/Users/yves/Temp/data/numbers .db 


In [97]: %%time 
temp = q('SELECT * FROM numbers').fetchall() (2) 
print(temp[:3]) 
[(0.4918, 1.3707, 0.137, 0.3981, -1.0059), (0.4516, 1.4445, 0.0555, 
-0.0397, 0.44), (0.1629, -0.8473, -0.8223, -0.4621, -0.5137)] 
CPU times: user 1.7 s, sys: 124 ms, total: 1.82 s 
Wall time: 1.9 s 


In [98]: %%time 
query = 'SELECT * FROM numbers WHERE No1 > © AND No2 < 0' 
res = np.array(q(query).fetchall()).round(3) © 
CPU times: user 639 ms, sys: 64.7 ms, total: 704 ms 
Wall time: 702 ms 


In [99]: res = res[::100] e 

plt.figure(figsize=(10, 6)) 

plt.plot(res[:, 0], res[:, 1], 'ro') (4) 
Inserts the whole data set into the table in a single step. 


Retrieves all the rows from the table in a single step. 


Retrieves a selection of the rows and transforms it to an ndarray object. 


o © 8 8 


Plots a subset of the query result. 
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Figure 9-1. Scatter plot of the query result (selection) 


From SQL to pandas 


A generally more efficient approach, however, is the reading of either whole tables or 
query results with pandas. When one can read a whole table into memory, analytical 
queries can generally be executed much faster than when using the SQL disk-based 
approach (out-of-memory). 


Reading the whole table with pandas takes roughly the same amount of time as read- 
ing it into a NumPy ndarray object. There as here, the bottleneck performance-wise is 


the SQL database: 


In [100]: %time data = pd.read_sql('SELECT * FROM numbers', con) (13 


CPU times: user 2.17 s, sys: 180 ms, total: 2.35 s 
Wall time: 


2.32 $ 


In [101]: data.head() 


Out[101]: No1 
© 0.4918 
1 0.4516 
2 0.1629 
3 1.3064 
4 -0.1148 


No2 
1:3707 
1.4445 

-0.8473 
0.9125 
-1.5215 


No3 
0.1370 
0:0555 


-0.8223 


0.5142 


-0.7045 


No4 
0.3981 
-0.0397 
-0.4621 
-0.7868 
-1.0042 


No5 
-1.0059 
0.4400 
-0.5137 
-0.3398 
-0.0600 


@ Reads all rows of the table into the DataFrame object named data. 


The data is now in-memory, which allows for much faster analytics. The speedup is 
often an order of magnitude or more. pandas can also master more complex queries, 
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although it is neither meant nor able to replace SQL databases when it comes to com- 
plex relational data structures. The result of the query with multiple conditions com- 
bined is shown in Figure 9-2: 


In [102]: %time data[(data['No1'] > 0) & (data['No2'] < 0)].head() (1) 
CPU times: user 47.1 ms, sys: 12.3 ms, total: 59.4 ms 
Wall time: 33.4 ms 


Out[102]: No1 No2 No3 No4 No5 
2 0.1629 -0.8473 -0.8223 -0.4621 -0.5137 
5 0.1893 -0,0207 -0.2104 0.9419 0.2551 
8 1.4784 -0.3333 -0.7050 0.3586 -0.3937 
10 0.8092 -0.9899 1.0364 -1.0453 0.0579 
11 0.9065 -0:7757 -0.9267 0.7797 ©.0863 


In [103]: %%time 
q = '(No1 < -0.5 | No1 > 0.5) & (No2 < -1 | No2>1)' @ 
res = data[['No1', 'No2']].query(q) 
CPU times: user 95.4 ms, sys: 22.4 ms, total: 118 ms 
Wall time: 56.4 ms 


In [104]: plt.figure(figsize=(10, 6)) 
plt.plot(res['No1'], res['No2'], 'ro'); 


Two conditions combined logically. 


Four conditions combined logically. 


=2 


Figure 9-2. Scatter plot of the query result (selection) 
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As expected, using the in-memory analytics capabilities of pandas leads to a signifi- 
cant speedup, provided pandas is able to replicate the respective SQL statement. 


This is not the only advantage of using pandas, since pandas is tightly integrated with 
a number of other packages (including PyTables, the topic of the subsequent sec- 
tion). Here, it suffices to know that the combination of both can speed up I/O opera- 
tions considerably. This is shown in the following: 


© 
@ 


In [105]: 


In [106]: 


In [107]: 
Out[107]: 


In [108]: 


h5s = pd.HDFStore(filename + '.hSs', 'w') 1) 


%time h5s['data'] = data e 
CPU times: user 46.7 ms, sys: 47.1 ms, total: 93.8 ms 
Wall time: 99.7 ms 


hss © 
<class 'pandas.io.pytables.HDFStore'> 
File path: /Users/yves/Temp/data/numbers.h5s 


h5s.close() e 


This opens an HDF5 database file for writing; in pandas an HDFStore object is 


created. 


The complete DataFrame object is stored in the database file via binary storage. 


The HDFStore object information. 


The database file is closed. 


The whole DataFrame with all the data from the original SQL table is written much 
faster when compared to the same procedure with SQLite3. Reading is even faster: 


In [109]: 


In [110]: 
Out[110]: 


In [111]: 
Out[111]: 


%%time 

h5s = pd.HDFStore(filename + '.h5s', 'r') (1) 
data_ = h5s['data'] 

h5s.close() © 

CPU times: user 11 ms, sys: 18.3 ms, total: 29.3 ms 
Wall time: 29.4 ms 


data_ is data (4) 
False 


(data_ == data).all() © 


No1 True 
No2 True 
No3 True 
No4 True 
No5 True 
dtype: bool 
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In [112]: np.allclose(data_, data) (5) 
Out[112]: True 


In [113]: UL Spath* (6) 
-rw-r--r-- 1 yves staff 52633600 Oct 19 12:11 
/Users/yves/Temp/data/numbers.db 
-rw-r--r-- 1 yves staff 48007240 Oct 19 12:11 
/Users/yves/Temp/data/numbers.h5s 


This opens the HDF5 database file for reading. 

The DataFrame is read and stored in-memory as data_. 
The database file is closed. 

The two DataFrame objects are not the same... 


... but they now contain the same data. 


O © 6 O 8 8 


Binary storage generally comes with less size overhead compared to SQL tables, 
for instance. 


Working with CSV Files 


One of the most widely used formats to exchange financial data is the CSV format. 
Although it is not really standardized, it can be processed by any platform and the 
vast majority of applications concerned with data and financial analytics. Earlier, we 
saw how to write and read data to and from CSV files with standard Python function- 
ality (see “Reading and Writing Text Files” on page 236). pandas makes this whole 
procedure a bit more convenient, the code more concise, and the execution in general 
faster (see also Figure 9-3): 


In [114]: %time data.to_csv(filename + '.csv') (1) 
CPU times: user 6.44 s, sys: 139 ms, total: 6.58 s 
Wall time: 6.71 s 


In [115]: Ul $path 
total 283672 
-rw-r--r-- 1 yves staff 43834157 Oct 19 12:11 numbers.csv 
-rw-r--r-- 1 yves staff 52633600 Oct 19 12:11 numbers.db 
-rw-r--r-- 1 yves staff 48007240 Oct 19 12:11 numbers.h5s 


In [116]: “time df = pd.read_csv(filename + '.csv') (2) 
CPU times: user 1.12 s, sys: 111 ms, total: 1.23 s 
Wall time: 1.23 s 


In [117]: df[['No1', 'No2', 'No3', 'No4']].hist(bins=20, figsize=(10, 6)); 
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@ The .to_csv() method writes the DataFrame data to disk in CSV format. 


© The pd.read_csv() method then reads it back into memory as a new DataFrame 


object. 
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Figure 9-3. Histograms for selected columns 


Working with Excel Files 


The following code briefly demonstrates how pandas can write data in Excel format 
and read data from Excel spreadsheets. In this case, the data set is restricted to 
100,000 rows (see also Figure 9-4): 


In [118]: 


In [119]: 


In [120]: 
In [121]: 


%time data[:100000].to_excel(filename + '.xlsx') (1) 
CPU times: user 25.9 s, sys: 520 ms, total: 26.4 s 
Wall time: 27.3 s 


%time df = pd.read_excel(filename + '.xlsx', 'Sheet1') (2) 
CPU times: user 5.78 s, sys: 70.1 ms, total: 5.85 s 
Wall time: 5.91 s 


df.cumsum().plot(figsize=(10, 6)); 

LL Spath* 

-rw-r--r-- 1 yves staff 43834157 Oct 19 12:11 
/Users/yves/Temp/data/numbers.csv 

-rw-r--r-- 1 yves staff 52633600 Oct 19 12:11 
/Users/yves/Temp/data/numbers.db 

-rw-r--r-- 1 yves staff 48007240 Oct 19 12:11 
/Users/yves/Temp/data/numbers.h5s 
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-rw-r--r-- 1 yves staff 4032725 Oct 19 12:12 
/Users/yves/Temp/data/numbers.xlsx 


In [122]: rm -f Spath* 
The .to_excel() method writes the DataFrame data to disk in XLSX format. 


The pd.read_excel() method then reads it back into memory as a new Data 
Frame object, also specifying the sheet from which to read. 


Generating the Excel spreadsheet file with a smaller subset of the data takes quite a 
while. This illustrates what kind of overhead the spreadsheet structure brings along 
with it. 

Inspection of the generated files reveals that the DataFrame with HDFStore combina- 
tion is the most compact alternative (using compression, as described in the next sec- 
tion, further increases the benefits). The same amount of data as a CSV file—i.e., as a 
text file—is somewhat larger in size. This is one reason for the slower performance 
when working with CSV files, the other being the very fact that they are “only” gen- 
eral text files. 
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Figure 9-4. Line plots for all columns 


1/0 with PyTables 


PyTables is a Python binding for the HDF5 database standard. It is specifically 
designed to optimize the performance of I/O operations and make best use of the 
available hardware. The library’s import name is tables. Similar to pandas, when it 
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comes to in-memory analytics PyTables is neither able nor meant to be a full replace- 
ment for SQL databases. However, it brings along some features that further close the 
gap. For example, a PyTables database can have many tables, and it supports com- 
pression and indexing and also nontrivial queries on tables. In addition, it can store 
NumPy arrays efficiently and has its own flavor of array-like data structures. 


To begin with, some imports: 


In [123]: import tables as tb (1) 
import datetime as dt 


@ The package name is PyTables, the import name is tables. 


Working with Tables 


PyTables provides a file-based database format, similar to SQLite3.° The following 
opens a database file and creates a table: 


In [124]: filename = path + 'pytab.h5' 
In [125]: h5 = tb.open_file(filename, 'w') (1) 


In [126]: row_des = { 
'Date': tb.StringCol(26, pos=1), (2) 
'No1': tb.IntCol(pos=2), 
'No2': tb.IntCol(pos=3), © 
'No3': tb.Float64Col(pos=4), 4] 
'No4': tb.Float64Col(pos=5) @ 
} 


In [127]: rows = 2000000 
In [128]: filters = tb.Filters(complevel=0) (5) 


In [129]: tab = h5.create_table('/', 'ints_floats', Q 
row_des, (7) 
title='Integers and Floats', (8) 
expectedrows=rows, © 
filters=filters) (10) 


In [130]: type(tab) 
Out[130]: tables.table. Table 


In [131]: tab 
Out[131]: /ints_floats (Table(0,)) ‘Integers and Floats' 
description := { 


5 Many other databases require a server/client architecture. For interactive data and financial analytics, file- 
based databases prove a bit more convenient and also sufficient for most purposes. 
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"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"Noi": Int32Col(shape=(), dflt=0, pos=1), 

"No2": Int32Col(shape=(), dflt=0, pos=2), 

"No3": Float64Col(shape=(), dflt=0.0, pos=3), 

"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 

byteorder := 'little' 

chunkshape := (2621,) 


Opens the database file in HDF5 binary storage format. 

The Date column for date-time information (as a str object). 

The two columns to store int objects. 

The two columns to store float objects. 

Via Filters objects, compression levels can be specified, among other things. 
The node (path) and technical name of the table. 

The description of the row data structure. 


The name (title) of the table. 


O © © O O 6 8 8 8 


The expected number of rows; allows for optimizations. 


The Filters object to be used for the table. 


To populate the table with numerical data, two ndarray objects with random num- 
bers are generated: one with random integers, the other with random floating-point 
numbers. The population of the table happens via a simple Python loop: 


In [132]: pointer = tab.row (1) 
In [133]: ran_int = np.random.randint(0, 10000, size=(rows, 2)) (2) 
In [134]: ran_flo = np.random.standard_normal((rows, 2)).round(4) © 


In [135]: %%time 

for i in range(rows): 
pointer['Date'] = dt.datetime.now() (4) 
pointer['No1'] = ran_int[i, 0] 
pointer['No2'] = ran_int[i, 1] (4) 
pointer['No3'] = ran_flo[i, 0] (4) 
pointer['No4'] = ran_flo[i, 1] (4) 
pointer .append() 

tab. flush() (6) 

CPU times: user 8.16 s, sys: 78.7 ms, total: 8.24 s 

Wall time: 8.25 s 
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© © 8 8 


(6) 
(7) 


In [136]: tab @ 
Out[136]: /ints_floats (Table(2000000,)) 'Integers and Floats' 
description := { 
"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"No1": Int32Col(shape=(), dflt=0, pos=1), 
"No2": Int32Col(shape=(), dflt=0, pos=2), 
"No3": Float64Col(shape=(), dflt=0.0, pos=3), 
"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 
byteorder := 'little' 
chunkshape := (2621,) 


In [137]: Ul $path* 
-rw-r--r-- 1 yves staff 100156248 Oct 19 12:12 
/Users/yves/Temp/data/pytab.h5 


A pointer object is created. 
The ndarray object with the random int objects is created. 
The ndarray object with the random float objects is created. 


The datetime object and the two int and two float objects are written row-by- 
row. 


The new row is appended. 
All written rows are flushed; i.e., committed as permanent changes. 


The changes are reflected in the Table object description. 


The Python loop is quite slow in this case. There is a more performant and Pythonic 
way to accomplish the same result, by the use of NumPy structured arrays. Equipped 
with the complete data set stored in a structured array, the creation of the table boils 
down to a single line of code. Note that the row description is not needed anymore; 
PyTables uses the dtype object of the structured array to infer the data types instead: 


In [138]: dty = np.dtype([('Date', 'S26'), ('No1', '<i4'), ('No2', '<i4'), 
('No3', '<f8'), ('No4', '<f8')]) © 


In [139]: sarray = np.zeros(len(ran_int), dtype=dty) (2) 


In [140]: sarray[:4] © 
Out[140]: array([(b'', ©, 0, ©., 0.), (b'', 0, 0, ©., 0.), (b'', 0, 0, 0., 0.), 
(b'', 0, 0, 0., 0.)], 
dtype=[('Date', 'S26'), ('No1', '<i4'), ('No2', '<i4'), ('No3', '<f8'), 
('No4', '<f8')]) 


In [141]: %%time 
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sarray['Date'] = dt.datetime.now() (4) 
sarray['No1'] = ran_int[:, 0] (4) 

sarray['No2'] = ran_int[:, 1] (4) 

sarray['No3'] = ran_flo[:, 0] (4) 

sarray['No4'] = ran_flo[:, 1] (4) 

CPU times: user 161 ms, sys: 42.7 ms, total: 204 ms 
Wall time: 207 ms 


In [142]: %%time 
h5.create_table('/', 'ints_floats_from_array', sarray, 
title='Integers and Floats', 
expectedrows=rows, filters=filters) (5) 
CPU times: user 42.9 ms, sys: 51.4 ms, total: 94.3 ms 
Wall time: 96.6 ms 


Out[142]: /ints_floats_from_array (Table(2000000,)) 'Integers and Floats' 
description := { 
"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"No1": Int32Col(shape=(), dflt=0, pos=1), 
"No2": Int32Col(shape=(), dflt=0, pos=2), 
"No3": Float64Col(shape=(), dflt=0.0, pos=3), 
"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 
byteorder := 'little' 
chunkshape := (2621,) 


This defines the special dtype object. 
This creates the structured array with zeros (and empty strings). 
A few records from the ndarray object. 


The columns of the ndarray object are populated at once. 
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This creates the Table object and populates it with the data. 


This approach is an order of magnitude faster, has more concise code, and achieves 
the same result: 


In [143]: type(h5) 
Out[143]: tables.file.File 


In [144]: hs @ 

Out[144]: File(filename=/Users/yves/Temp/data/pytab.h5, title='', mode='w', 
root_uep='/', filters=Filters(complevel=0, shuffle=False, 
bitshuffle=False, fletcher32=False, least_significant_digit=None) ) 

/ (RootGroup) '' 

/ints_floats (Table(2000000,)) ‘Integers and Floats' 
description := { 
"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"Noi": Int32Col(shape=(), dflt=0, pos=1), 
"No2": Int32Col(shape=(), dflt=0, pos=2), 
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"No3": Float64Col(shape=(), dflt=0.0, pos=3), 
"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 
byteorder := 'little' 
chunkshape := (2621,) 
/ints_floats_from_array (Table(2000000,)) ‘Integers and Floats' 
description := { 
"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"No1": Int32Col(shape=(), dflt=0, pos=1), 
"No2": Int32Col(shape=(), dflt=0, pos=2), 
"No3": Float64Col(shape=(), dflt=0.0, pos=3), 
"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 
byteorder := 'little' 
chunkshape := (2621,) 


In [145]: h5.remove_node('/', ‘ints_floats_from_array') (2) 
@ The description of the File object with the two Table objects. 


@ This removes the second Table object with the redundant data. 


The Table object behaves pretty similar to NumPy structured ndarray objects in most 
cases (see also Figure 9-5): 


In [146]: tab[:3] @ 
Out[146]: array([(b'2018-10-19 12:12:28.227771', 8576, 5991, -0.0528, 0.2468), 
(b'2018-10-19 12:12:28.227858', 2990, 9310, -0.0261, 0.3932), 
(b'2018-10-19 12:12:28.227868', 4400, 4823, 0.9133, 0.2579)], 
dtype=[('Date', 'S26'), ('No1', '<i4'), ('No2', '<i4'), ('No3', '<f8'), 
('No4', '<f8')]) 


In [147]: tab[:4]['No4'] © 
Out[147]: array([ 0.2468, 0.3932, 0.2579, -0.5582]) 


In [148]: %time np.sum(tab[:]['No3']) © 
CPU times: user 76.7 ms, sys: 74.8 ms, total: 151 ms 
Wall time: 152 ms 


Out[148]: 88.8542999999997 


In [149]: %time np.sum(np.sqrt(tab[:]['No1'])) © 
CPU times: user 91 ms, sys: 57.9 ms, total: 149 ms 
Wall time: 164 ms 


Out[149]: 133349920.3689251 


In [150]: %%time 
plt.figure(figsize=(10, 6)) 
plt.hist(tab[:]['No3'], bins=30); @ 
CPU times: user 328 ms, sys: 72.1 ms, total: 400 ms 
Wall time: 456 ms 
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Selecting rows via indexing. 
Selecting column values only via indexing. 


Applying NumPy universal functions. 
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Plotting a column from the Table object. 
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Figure 9-5. Histogram of column data 


PyTables also provides flexible tools to query data via typical SQL-like statements, as 
in the following example (the result of which is illustrated in Figure 9-6; compare it 
with Figure 9-2, based on a pandas query): 


In [151]: query = '((No3 < -0.5) | (No3 > 0.5)) & ((No4 < -1) | (No4>1))' @ 
In [152]: iterator = tab.where(query) @ 


In [153]: %time res = [(row['No3'], row['No4']) for row in iterator] © 
CPU times: user 269 ms, sys: 64.4 ms, total: 333 ms 
Wall time: 294 ms 


In [154]: res = np.array(res) (4) 
res[:3] 
Out[154]: array([[0.7694, 1.4866], 
[0.9201, 1.3346], 
[1.4701, 1.8776]]) 
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In [155]: plt.figure(figsize=(10, 6)) 
plt.plot(res.T[0], res.T[1], 'ro'); 


The query as a str object, four conditions combined by logical operators. 
The iterator object based on the query. 


The rows resulting from the query are collected via a list comprehension ... 
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... and transformed to an ndarray object. 


-4 


Figure 9-6. Scatter plot of column data 


Fast Queries 


Both pandas and PyTables are able to process relatively complex, 
SQL-like queries and selections. They are both optimized for speed 
when it comes to such operations. Although there are limits to 
these approaches compared to relational databases, for most 
numerical and financial applications these are often not relevant. 


As the following examples show, working with data stored in PyTables as Table 
objects gives the impression of working with NumPy or pandas objects in-memory, 
both from a syntax and a performance point of view: 

In [156]: %%time 


values = tab[:]['No3'] 
print('Max %18.3f' % values.max()) 
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print('Ave %18.3f' % values.mean()) 
print('Min %18.3f' % values.min()) 
print('Std %18.3f' % values.std()) 


Max 5.224 
Ave 0.000 
Min -5.649 
Std 1.000 


CPU times: user 163 ms, sys: 70.4 ms, total: 233 ms 
Wall time: 234 ms 


In [157]: %%time 
res = [(row['No1'], row['No2']) for row in 
tab.where('((No1 > 9800) | (No1 < 200)) \ 
& ((No2 > 4500) & (No2 < 5500))')] 
CPU times: user 165 ms, sys: 52.5 ms, total: 218 ms 
Wall time: 155 ms 


In [158]: for r in res[:4]: 
print(r) 
(91, 4870) 
(9803, 5026) 
(9846, 4859) 
(9823, 5069) 


In [159]: %%time 
res = [(row['No1'], row['No2']) for row in 
tab.where('(No1 == 1234) & (No2 > 9776)')] 
CPU times: user 58.9 ms, sys: 40.5 ms, total: 99.4 ms 
Wall time: 81 ms 


In [160]: for r in res: 
print(r) 

(1234, 9841) 

(1234, 9821) 

(1234, 9867) 

(1234, 9987) 

(1234, 9849) 

(1234, 9800) 


Working with Compressed Tables 


A major advantage of working with PyTables is the approach it takes to compression. 
It uses compression not only to save space on disk, but also to improve the perfor- 
mance of I/O operations in certain hardware scenarios. How does this work? When 
T/O is the bottleneck and the CPU is able to (de)compress data fast, the net effect of 
compression in terms of speed might be positive. Since the following examples are 
based on the I/O of a standard SSD, there is no speed advantage of compression to be 
observed. However, there is also almost no disadvantage to using compression: 


In [161]: filename = path + 'pytabc.h5S' 
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In [162]: h5c = tb.open_file(filename, 'w') 


In [163]: filters = tb.Filters(complevel=5, (1) 
complib='blosc') (2) 


In [164]: tabc = h5c.create_table('/', 'ints_floats', sarray, 
title='Integers and Floats', 
expectedrows=rows, filters=filters) 


In [165]: query = '((No3 < -0.5) | (No3 > 0.5)) & ((No4 < -1) | (No4 > 1))' 
In [166]: iteratorc = tabc.where(query) © 


In [167]: %time res = [(row['No3'], row['No4']) for row in iteratorc] (4) 
CPU times: user 300 ms, sys: 50.8 ms, total: 351 ms 
Wall time: 311 ms 


In [168]: res = np.array(res) 
res[:3] 
Out[168]: array([[0.7694, 1.4866], 
[0.9201, 1.3346], 
[1.4701, 1.8776]]) 


@ The complevel (compression level) parameter can take values between 0 (no 
compression) and 9 (highest compression). 


@ The Blosc compression engine is used, which is optimized for performance. 
© This creates the iterator object, based on the query from before. 


© The rows resulting from the query are collected via a list comprehension. 


Generating the compressed Table object with the original data and doing analytics 
on it is slightly slower compared to the uncompressed Table object. What about 
reading the data into an ndarray object? Let’s check: 


In [169]: %time arr_non = tab.read() (1) 
CPU times: user 63 ms, sys: 78.5 ms, total: 142 ms 
Wall time: 149 ms 


In [170]: tab.size_on_disk 
Out[170]: 100122200 


In [171]: arr_non.nbytes 
Out[171]: 100000000 


In [172]: %time arr_com = tabc.read() (2) 
CPU times: user 106 ms, sys: 55.5 ms, total: 161 ms 
Wall time: 173 ms 


In [173]: tabc.size_on_disk 
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Out[173]: 41306140 


In [174]: arr_com.nbytes 
Out[174]: 100000000 


In [175]: ll Spath* ® 
-rw-r--r-- 1 yves staff 200312336 Oct 19 12:12 
/Users/yves/Temp/data/pytab.h5 
-rw-r--r-- 1 yves staff 41341436 Oct 19 12:12 
/Users/yves/Temp/data/pytabc.h5 


In [176]: h5c.close() e 


Reading from the uncompressed Table object tab. 


© 


Reading from the compressed Table object tabc. 


© 


Comparing the sizes—the size of the compressed table is significantly reduced. 


Closing the database file. 


The examples show that there is hardly any speed difference when working with 
compressed Table objects as compared to uncompressed ones. However, file sizes on 
disk might—depending on the quality of the data—be significantly reduced, which 
has a number of benefits: 


e Storage costs are reduced. 
e Backup costs are reduced. 
e Network traffic is reduced. 


e Network speed is improved (storage on and retrieval from remote servers is 
faster). 


e CPU utilization is increased to overcome I/O bottlenecks. 


Working with Arrays 


“Basic I/O with Python” on page 232 showed that NumPy has built-in fast writing and 
reading capabilities for ndarray objects. PyTables is also quite fast and efficient when 
it comes to storing and retrieving ndarray objects, and since it is based on a hierarch- 
ical database structure, many convenience features come on top: 


In [177]: %%time 
arr_int = h5.create_array('/', 'integers', ran_int) (13 
arr_flo = h5.create_array('/', 'floats', ran_flo) (2) 
CPU times: user 4.26 ms, sys: 37.2 ms, total: 41.5 ms 
Wall time: 46.2 ms 
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In [178]: hs © 
Out[178]: File(filename=/Users/yves/Temp/data/pytab.h5, title='', mode='w', 
root_uep='/', filters=Filters(complevel=0, shuffle=False, 
bitshuffle=False, fletcher32=False, least_significant_digit=None) ) 
/ (RootGroup) '' 
/floats (Array(2000000, 2)) '' 
atom := Float64Atom(shape=(), dflt=0.0) 
maindim := 0 
flavor := 'numpy' 
byteorder := 'little' 
chunkshape := None 
/integers (Array(2000000, 2)) '' 
atom := Int64Atom(shape=(), dflt=0) 
maindim := 0 
flavor := 'numpy' 
byteorder := 'little' 
chunkshape := None 
/ints_floats (Table(2000000,)) ‘Integers and Floats' 
description := { 
"Date": StringCol(itemsize=26, shape=(), dflt=b'', pos=0), 
"Noi": Int32Col(shape=(), dflt=0, pos=1), 
"No2": Int32Col(shape=(), dflt=0, pos=2), 
"No3": Float64Col(shape=(), dflt=0.0, pos=3), 
"No4": Float64Col(shape=(), dflt=0.0, pos=4)} 
byteorder := 'little' 
chunkshape := (2621,) 


In [179]: Ul Spath* 
-rw-r--r-- 1 yves staff 262344490 Oct 19 12:12 
/Users/yves/Temp/data/pytab.h5 
-rw-r--r-- 1 yves staff 41341436 Oct 19 12:12 
/Users/yves/Temp/data/pytabc.h5 


In [180]: h5.close() 
In [181]: !rm -f Spath* 
@ Stores the ran_int ndarray object. 


@ Stores the ran_flo ndarray object. 


© The changes are reflected in the object description. 


Writing these objects directly to an HDF5 database is faster than looping over the 
objects and writing the data row-by-row to a Table object or using the approach via 
structured ndarray objects. 
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HDF5-Based Data Storage 


The HDF5 hierarchical database (file) format is a powerful alterna- 
tive to, for example, relational databases when it comes to struc- 
tured numerical and financial data. Both on a standalone basis 
when using PyTables directly and when combining it with the 
capabilities of pandas, one can expect to get almost the maximum 
I/O performance that the available hardware allows. 


Out-of-Memory Computations 


PyTables supports out-of-memory operations, which makes it possible to implement 
array-based computations that do not fit in memory. To this end, consider the fol- 
lowing code based on the EArray class. This type of object can be expanded in one 
dimension (row-wise), while the number of columns (elements per row) needs to be 


fixed: 


In [182]: 
In [183]: 
In [184]: 


In [185]: 


In [186]: 
Out[186]: 


In [187]: 


Out[187]: 


In [188]: 


In [189]: 
Out[189]: 


filename = path + 'earray.h5' 
h5 = tb.open_file(filename, 'w') 
n=500 @ 


ear = h5.create_earray('/', 'ear', (2) 
atom=tb.Float64Atom(), © 
shape=(0, n)) (4) 


type(ear) 
tables.earray.EArray 


rand = np.random.standard_normal((n, n)) (5) 

rand[:4, :4] 

array([[-1.25983231, 1.11420699, 0.1667485 , 0.7345676 ], 
[-0.13785424, 1.22232417, 1.36303097, 0.13521042], 
[ 1.45487119, -1.47784078, 0.15027672, 0.86755989], 
[-0.63519366, 0.1516327 , -0.64939447, -0.45010975]]) 


%%time 
for _ in range(750): 
ear .append(rand) Q 
ear.flush() 
CPU times: user 814 ms, sys: 1.18 s, total: 1.99 s 
Wall time: 2.53 s 


ear 
/ear (EArray(375000, 500)) '' 
atom := Float64Atom(shape=(), dflt=0.0) 


maindim := 0 
flavor := 'numpy' 
byteorder := 'little' 
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chunkshape := (16, 500) 


In [190]: ear.size_on_disk 
Out[190]: 1500032000 


The fixed number of columns. 
The path and technical name of the EArray object. 


The atomic dtype object of the single values. 


oO 
(2) 
© 
© The shape for instantiation (no rows, n columns). 
© The ndarray object with the random numbers ... 
16) 


... that gets appended many times. 


For out-of-memory computations that do not lead to aggregations, another EArray 
object of the same shape (size) is needed. PyTables has a special module to cope with 
numerical expressions efficiently. It is called Expr and is based on the numerical 
expression library numexpr. The code that follows uses Expr to calculate the mathe- 
matical expression in Equation 9-1 on the whole EArray object from before. 


Equation 9-1. Example mathematical expression 
y =3sin(x)+7 |x| 


The results are stored in the out EArray object, and the expression evaluation hap- 
pens chunk-wise: 


In [191]: out = hS5.create_earray('/', ‘out', 
atom=tb.Float64Atom(), 
shape=(0, n)) 


In [192]: out.size_on_disk 
Out[192]: 0 


In [193]: expr = tb.Expr('3 * sin(ear) + sqrt(abs(ear))') (13 


In [194]: expr.set_output(out, append_mode=True) (2) 


In [195]: %time expr.eval() © 
CPU times: user 3.08 s, sys: 1.7 s, total: 4.78 s 
Wall time: 4.03 s 


Out[195]: /out (EArray(375000, 500)) '' 
atom := Float64Atom(shape=(), dflt=0.0) 
maindim := 0 
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In [196]: 
Out[196]: 


In [197]: 
Out[197]: 


In [198]: 


In [199]: 
Out[199]: 


© O 6 


flavor := 'numpy' 
byteorder := 'little' 
chunkshape := (16, 500) 


out.size_on_disk 
1500032000 


out[0, :10] 

array([-1.73369462, 3.74824436, 0.90627898, 2.86786818, 
1.75424957, 

-@.91108973, -1.68313885, 1.29073295, -1.68665599, -1.71345309]) 


%time out_ = out.read() (4) 
CPU times: user 1.03 s, sys: 1.1 s, total: 2.13 s 
Wall time: 2.22 s 


out_[0, :10] 

array([-1.73369462, 3.74824436, 0.90627898, 2.86786818, 
1.75424957, 

-0.91108973, -1.68313885, 1.29073295, -1.68665599, -1.71345309]) 


Transforms a str object-based expression to an Expr object. 
Defines the output to be the out EArray object. 


Initiates the evaluation of the expression. 


Reads the whole EArray into memory. 


Given that the whole operation takes place out-of-memory, it can be considered quite 
fast, in particular as it is executed on standard hardware. As a benchmark, the in- 
memory performance of the numexpr module (see also Chapter 10) can be consid- 
ered. It is faster, but not by a huge margin: 


In [200] 
In [201] 


In [202]: 
Out[202]: 


In [203]: 


Out[203]: 


In [204]: 
Out[204]: 


: import numexpr as ne @ 


: expr = '3 * sin(out_) + sqrt(abs(out_))' (2) 


ne.set_num_threads(1) © 
4 


%time ne.evaluate(expr)[0, :10] (4) 
CPU times: user 2.51 s, sys: 1.54 s, total: 4.05 s 
Wall time: 4.94 s 


array([-1.64358578, 0.22567882, 3.31363043, 2.50443549, 
4.27413965, 
-1.41600606, -1.68373023, 4.01921805, -1.68117412, -1.66053597]) 


ne.set_num_threads(4) (5) 
1 
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In [205]: %time ne.evaluate(expr)[0, :10] Q 
CPU times: user 3.39 s, sys: 1.94 s, total: 5.32 s 
Wall time: 2.96 s 
Out[205]: array([-1.64358578, 0.22567882, 3.31363043, 2.50443549, 
4.27413965, 
-1.41600606, -1.68373023, 4.01921805, -1.68117412, -1.66053597]) 
In [206]: h5.close() 
In [207]: !rm -f $path* 
Imports the module for in-memory evaluations of numerical expressions. 
The numerical expression as a str object. 


Sets the number of threads to one. 


Evaluates the numerical expression in-memory with one thread. 
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Sets the number of threads to four. 


© Evaluates the numerical expression in-memory with four threads. 
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The package TsTables uses PyTables to build a high-performance storage for time 
series data. The major usage scenario is “write once, retrieve multiple times.” This is a 
typical scenario in financial analytics, where data is created in the markets, retrieved 
in real-time or asynchronously, and stored on disk for later usage. That usage might 
be in a larger trading strategy backtesting program that requires different subsets of a 
historical financial time series over and over again. It is then important that data 
retrieval happens fast. 


Sample Data 


As usual, the first task is the generation of a sample data set that is large enough to 
illustrate the benefits of TsTables. The following code generates three rather long 
financial time series based on the simulation of a geometric Brownian motion (see 
Chapter 12): 


In [208]: no = 5000000 @ 
co = 3 (2) 
interval = 1. / (12 * 30 * 24 * 60) © 
vól = 0.2 


1/0 with TsTables | 267 


In [209]: %%time 

rn = np.random.standard_normal((no, co)) (5) 

rn[0] = 0.0 

paths = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * interval + 
vol * np.sqrt(interval) * rn, axis=0)) @ 

paths[0] = 100 

CPU times: user 869 ms, sys: 175 ms, total: 1.04 s 

Wall time: 812 ms 


The number of time steps. 

The number of time series. 

The time interval as a year fraction. 

The volatility. 

Standard normally distributed random numbers. 
Sets the initial random numbers to zero. 


The simulation based on an Euler discretization. 
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Sets the initial values of the paths to 100. 


Since TsTables works pretty well with pandas DataFrame objects, the data is trans- 
formed to such an object (see also Figure 9-7): 


In [210]: dr = pd.date_range('2019-1-1', periods=no, freq='1s') 


In [211]: dr[-6:] 
Out[211]: DatetimeIndex(['2019-02-27 20:53:14', '2019-02-27 20:53:15', 
"2019-02-27 20:53:16', "2019-02-27 20:53:17', 
'2019-02-27 20:53:18', '2019-02-27 20:53:19'], 
dtype='datetime64[ns]', freq='S') 


In [212]: df = pd.DataFrame(paths, index=dr, columns=['ts1', 'ts2', 'ts3']) 


In [213]: df.info() 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 5000000 entries, 2019-01-01 00:00:00 to 2019-02-27 
20:53:19 
Freq: S 
Data columns (total 3 columns): 
ts1 float64 
ts2 float64 
ts3 float64 
dtypes: float64(3) 
memory usage: 152.6 MB 
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In [214]: df.head() 

Out[214]: ts1 ts2 ts3 
2019-01-01 00:00:00 100.000000 100.000000 100.000000 
2019-01-01 00:00:01 100.018443 99.966644 99.998255 
2019-01-01 00:00:02 100.069023 100.004420 99.986646 
2019-01-01 00:00:03 100.086757 100.000246 99.992042 
2019-01-01 00:00:04 100.105448 100.036033 99.950618 


In [215]: df[::100000].plot(figsize=(10, 6)); 
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Figure 9-7. Selected data points of the financial time series 


Data Storage 


TsTables stores financial time series data based on a specific chunk-based structure 
that allows for fast retrieval of arbitrary data subsets defined by some time interval. 
To this end, the package adds the function create_ts() to PyTables. To provide the 
data types for the table columns, the following uses a method based on the tb.Is 
Description class from PyTables: 


In [216]: import tstables as tstab 


In [217]: class ts_desc(tb.IsDescription): 
timestamp = tb.Int64Col(pos=0) (13 
ts1 = tb.Float64Col(pos=1) (2) 
ts2 = tb.Float64Col(pos=2) (2) 
ts3 = tb.Float64Col(pos=3) (2) 


| 
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In [218]: h5 = tb.open_file(path + 'tstab.h5', 'w') © 

In [219]: ts = h5.create_ts('/', 'ts', ts_desc) (4) 

In [220]: %time ts.append(df) (5) 
CPU times: user 1.36 s, sys: 497 ms, total: 1.86 s 
Wall time: 1.29 s 


In [221]: type(ts) 
Out[221]: tstables.tstable.TsTable 


In [222]: ls -n $path 
total 328472 
-rw-r--r-- 1 501 20 157037368 Oct 19 12:13 tstab.h5 


The column for the timestamps. 
The columns to store the numerical data. 
Opens an HDF5 database file for writing (w). 


Creates the TsTable object based on the ts_desc object. 
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Appends the data from the DataFrame object to the TsTable object. 


Data Retrieval 


Writing data with TsTables obviously is quite fast, even if hardware-dependent. The 
same holds true for reading chunks of the data back into memory. Conveniently, 
TsTables returns a DataFrame object (see also Figure 9-8): 


In [223]: read_start_dt = dt.datetime(2019, 2, 1, 0, 0) (1) 
read_end_dt = dt.datetime(2019, 2, 5, 23, 59) (2) 


In [224]: %time rows = ts.read_range(read_start_dt, read_end_dt) © 
CPU times: user 182 ms, sys: 73.5 ms, total: 255 ms 
Wall time: 163 ms 


In [225]: rows.info() (4) 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 431941 entries, 2019-02-01 00:00:00 to 2019-02-05 
23:59:00 
Data columns (total 3 columns): 
tsi 431941 non-null float64 
ts2 431941 non-null float64 
ts3 431941 non-null float64 
dtypes: float64(3) 
memory usage: 13.2 MB 


In [226]: rows.head() e 
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Out[226]: tsi ts2 ts3 
2019-02-01 00:00:00 52.063640 40.474580 217.324713 
2019-02-01 00:00:01 52.087455 40.471911 217.250070 
2019-02-01 00:00:02 52.084808 40.458013 217.228712 
2019-02-01 00:00:03 52.073536 40.451408 217.302912 
2019-02-01 00:00:04 52.056133 40.450951 217.207481 


In [227]: h5.close() 


In [228]: (rows[::500] / rows.iloc[0]).plot(figsize=(10, 6)); 


@ The start time of the interval. 
© Theend time of the interval. 
© The function ts.read_range() returns a DataFrame object for the interval. 
© The DataFrame object has a few hundred thousand data rows. 
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1.20 E 
— ts3 
1.15 
1.10 
1.05 
1.00 
Wii 
0.95 | 
0.90 
0.85 
01 02 03 04 05 06 
Feb 
2019 


Figure 9-8. A specific time interval of the financial time series (normalized) 


To better illustrate the performance of the TsTables-based data retrieval, consider 
the following benchmark, which retrieves 100 chunks of data consisting of 3 days’ 
worth of 1-second bars. The retrieval of a DataFrame with 345,600 rows of data takes 
less than one-tenth of a second: 


In [229]: import random 
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In [230]: h5 = tb.open_file(path + 'tstab.h5', 'r') 
In [231]: ts = h5.root.ts._f_get_timeseries() (1) 


In [232]: %%time 

for _ in range(100): (2) 
d = random.randint(1, 24) © 
read_start_dt = dt.datetime(2019, 2, d, 0, 0, 0) 
read_end_dt = dt.datetime(2019, 2, d + 3, 23, 59, 59) 
rows = ts.read_range(read_start_dt, read_end_dt) 

CPU times: user 7.17 s, sys: 1.65 s, total: 8.81 s 

Wall time: 4.78 s 


In [233]: rows.info() (4) 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 345600 entries, 2019-02-04 00:00:00 to 2019-02-07 
23:59:59 
Data columns (total 3 columns): 
tsi 345600 non-null float64 
ts2 345600 non-null float64 
ts3 345600 non-null float64 
dtypes: float64(3) 
memory usage: 10.5 MB 


In [234]: !rm $path/tstab.h5 


This connects to the TsTable object. 


© 


The data retrieval is repeated many times. 


© 


The starting day value is randomized. 


The last DataFrame object is retrieved. 


Conclusion 


SQL-based or relational databases have advantages when it comes to complex data 
structures that exhibit lots of relations between single objects/tables. This might jus- 
tify in some circumstances their performance disadvantage over pure NumPy 
ndarray-based or pandas DataFrame-based approaches. 


Many application areas in finance or science in general can succeed with a mainly 
array-based data modeling approach. In these cases, huge performance improve- 
ments can be realized by making use of native NumPy I/O capabilities, a combination 
of NumPy and PyTables capabilities, or the pandas approach via HDF5-based stores. 
TsTables is particularly useful when working with large (financial) time series data 
sets, especially in “write once, retrieve multiple times” scenarios. 
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While a recent trend has been to use cloud-based solutions—where the cloud is made 
up of a large number of computing nodes based on commodity hardware—one 
should carefully consider, especially in a financial context, which hardware architec- 
ture best serves the analytics requirements. A study by Microsoft sheds some light on 
this topic: 

We claim that a single “scale-up” server can process each of these jobs and do as well 


or better than a cluster in terms of performance, cost, power, and server density. 


—Appuswamy et al. (2013) 


Companies, research institutions, and others involved in data analytics should there- 
fore analyze first what specific tasks have to be accomplished in general and then 
decide on the hardware/software architecture, in terms of: 


Scaling out 
Using a cluster with many commodity nodes with standard CPUs and relatively 
low memory 


Scaling up 
Using one or a few powerful servers with many-core CPUs, possibly also GPUs 
or even TPUs when machine and deep learning play a role, and large amounts of 
memory 


Scaling up hardware and applying appropriate implementation approaches might 
significantly influence performance, which is the focus of the next chapter. 


Further Resources 


The paper cited at the beginning and end of the chapter is a good read, and a good 
starting point to think about hardware architecture for financial analytics: 


« Appuswamy, Raja, et al. (2013). “Nobody Ever Got Fired for Buying a Cluster”. 
Microsoft Technical Report. 


As usual, the web provides many valuable resources with regard to the topics and 
Python packages covered in this chapter: 

e For serialization of Python objects with pickle, refer to the documentation. 

e An overview of the I/O capabilities of NumPy is provided on the website. 

e For I/O with pandas, see the respective section in the online documentation. 

e The PyTables home page provides both tutorials and detailed documentation. 


e More information on TsTables can be found on its GitHub page. 
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A friendly fork for TsTables is found at http://github.com/yhilpisch/tstables. Use pip 
install git+git://github.com/yhilpisch/tstables to install the package from 
this fork, which is maintained for compatibility with newer versions of pandas and 
other Python packages. 
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CHAPTER 10 
Performance Python 


Don’t lower your expectations to meet your performance. Raise your level of perfor- 
mance to meet your expectations. 


—Ralph Marston 


It is a long-lived prejudice that Python per se is a relatively slow programming lan- 
guage and not appropriate to implement computationally demanding tasks in 
finance. Beyond the fact that Python is an interpreted language, the reasoning is usu- 
ally along the following lines: Python is slow when it comes to loops; loops are often 
required to implement financial algorithms; therefore Python is too slow for financial 
algorithm implementation. Another line of reasoning is: other (compiled) program- 
ming languages are fast at executing loops (such as C or C++); loops are often 
required for financial algorithms; therefore these (compiled) programming languages 
are well suited for finance and financial algorithm implementation. 


Admittedly, it is possible to write proper Python code that executes rather slowly— 
perhaps too slowly for many application areas. This chapter is about approaches to 
speed up typical tasks and algorithms often encountered in a financial context. It 
shows that with a judicious use of data structures, choosing the right implementation 
idioms and paradigms, as well as using the right performance packages, Python is 
able to compete even with compiled programming languages. This is due to, among 
other factors, getting compiled itself. 


To this end, this chapter introduces different approaches to speed up code: 


Vectorization 
Making use of Python’s vectorization capabilities is one approach already used 
extensively in previous chapters. 
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Dynamic compiling 
Using the Numba package allows one to dynamically compile pure Python code 
using LLVM technology. 


Static compiling 
Cython is not only a Python package but a hybrid language that combines Python 
and C; it allows one, for instance, to use static type declarations and to statically 
compile such adjusted code. 


Multiprocessing 
The multiprocessing module of Python allows for easy and simple paralleliza- 
tion of code execution. 


This chapter addresses the following topics: 


“Loops” on page 276 
This section addresses Python loops and how to speed them up. 


“Algorithms” on page 281 
This section is concerned with standard mathematical algorithms that are often 
used for performance benchmarks, such as Fibonacci number generation. 


“Binomial Trees” on page 294 
The binomial option pricing model is a widely used financial model that allows 
for an interesting case study about a more involved financial algorithm. 


“Monte Carlo Simulation” on page 299 
Similarly, Monte Carlo simulation is widely used in financial practice for pricing 
and risk management. It is computationally demanding and has long been con- 
sidered the domain of such languages as C or C++. 


“Recursive pandas Algorithm” on page 304 
This section addresses the speedup of a recursive algorithm based on financial 
time series data. In particular, it presents different implementations for an algo- 
rithm calculating an exponentially weighted moving average (EWMA). 


Loops 


This section tackles the Python loop issue. The task is rather simple: a function shall 
be written that draws a certain “large” number of random numbers and returns the 
average of the values. The execution time is of interest, which can be estimated by the 
magic functions %time and %timeit. 
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Python 


Let’s get started “slowly’—forgive the pun. In pure Python, such a function might 


look like average_py(): 


In [1]: import random 
In [2]: def average_py(n): 
s=0 
for i in range(n): 
s += random.random() (2) 
returns /n © 
In [3]: n = 10000000 @ 
In [4]: %time average_py(n) (5) 
CPU times: user 1.82 s, sys: 10.4 ms, total: 1.83 s 
Wall time: 1.93 s 
Out[4]: 0.5000590124747943 


In [5]: %timeit average_py(n) (6) 
1.31 s + 159 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


In [6]: %time sum([random.random() for _ in range(n)]) / n Q 
CPU times: user 1.55 s, sys: 188 ms, total: 1.74 s 
Wall time: 1.74 s 


Out[6]: 0.49987031710661173 


Initializes the variable value for s. 


Returns the average value (mean). 
Defines the number of iterations for the loop. 
Times the function once. 


Times the function multiple times for a more reliable estimate. 
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Uses a List comprehension instead of the function. 


This sets the benchmark for the other approaches to follow. 


Adds the uniformly distributed random values from the interval (0, 1) to s. 


Loops 
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NumPy 


The strength of NumPy lies in its vectorization capabilities. Formally, loops vanish on 
the Python level; the looping takes place one level deeper based on optimized and 
compiled routines provided by NumPy.' The function average_np() makes use of this 
approach: 


In [7]: import numpy as np 


In [8]: def average_np(n): 
s = np.random.random(n) 1] 
return s.mean() (2) 


In [9]: %time average_np(n) 
CPU times: user 180 ms, sys: 43.2 ms, total: 223 ms 
Wall time: 224 ms 


Out[9]: 0.49988861556468317 


In [10]: %timeit average _np(n) 
128 ms + 2.01 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 


In [11]: s = np.random.random(n) 
s.nbytes 
Out[11]: 80000000 


@ Draws the random numbers “all at once” (no Python loop). 
© Returns the average value (mean). 


© Number of bytes used for the created ndarray object. 


The speedup is considerable, reaching almost a factor of 10 or an order of magnitude. 
However, the price that must be paid is significantly higher memory usage. This is 
due to the fact that NumPy attains speed by preallocating data that can be processed in 
the compiled layer. As a consquence, there is no way, given this approach, to work 
with “streamed” data. This increased memory usage might even be prohibitively large 
depending on the algorithm or problem at hand. 


Vectorization and Memory 


It is tempting to write vectorized code with NumPy whenever possi- 
ble due to the concise syntax and speed improvements typically 
observed. However, these benefits often come at the price of a 
much higher memory footprint. 


1 NumPy can also make use of dedicated mathematics libraries, such as the Intel Math Kernel Library (MKL). 
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Numba 


Numba is a package that allows the dynamic compiling of pure Python code by the use 
of LLVM. The application in a simple case, like the one at hand, is surprisingly 
straightforward and the dynamically compiled function average_nb() can be called 
directly from Python: 


In [12]: 
In [13]: 


In [14]: 


Out[14]: 


In [15]: 


Out[15]: 


In [16]: 


import numba 

average_nb = numba. jit(average_py) (13 

%time average_nb(n) (2) 

CPU times: user 204 ms, sys: 34.3 ms, total: 239 ms 
Wall time: 278 ms 

0.4998865391283664 

%time average_nb(n) © 

CPU times: user 80.9 ms, sys: 457 ps, total: 81.3 ms 
Wall time: 81.7 ms 


0. 5001357454250273 


%timeit average_nb(n) © 
75.5 ms + 1.95 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 


This creates the Numba function. 


The compiling happens during runtime, leading to some overhead. 


From the second execution (with the same input data types), the execution is 


faster. 


The combination of pure Python with Numba beats the NumPy version and preserves 
the memory efficiency of the original loop-based implementation. It is also obvious 
that the application of Numba in such simple cases comes with hardly any program- 
ming overhead. 


No Free Lunch 


The application of Numba sometimes seems like magic when one 
compares the performance of the Python code to the compiled ver- 
sion, especially given its ease of use. However, there are many use 
cases for which Numba is not suited and for which performance 
gains are hardly observed or even impossible to achieve. 


Loops | 279 


Cython 


Cython allows one to statically compile Python code. However, the application is not 
as simple as with Numba since the code generally needs to be changed to see signifi- 
cant speed improvements. To begin with, consider the Cython function aver 

age_cy1(), which introduces static type declarations for the used variables: 


In [17]: 


In [18]: 


Out[18]: 


In [19]: 


Out[19]: 


In [20]: 


%load_ext Cython 


%%cython -a 
import random (1) 
def average_cy1(int n): (2) 
cdef int i @ 
cdef float s=0 @ 
for i in range(n): 
s += random.random() 
returns / n 
<IPython.core.display.HTML object> 


%time average_cy1(n) 

CPU times: user 695 ms, sys: 4.31 ms, total: 699 ms 
Wall time: 711 ms 

0.49997106194496155 


%timeit average_cy1(n) 
752 ms + 91.1 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


@ Imports the random module within the Cython context. 


@ Adds static type declarations for the variables n, i, and s. 


Some speedup is observed, but not even close to that achieved by, for example, the 
NumPy version. A bit more Cython optimization is necessary to beat even the Numba 
version: 


In [21]: 


%%cython 
from Libc.stdlib cimport rand (1) 
cdef extern from 'limits.h': @ 
int INT_MAX @ 
cdef int i 
cdef float rn 
for i in range(5): 
rn = rand() / INTMAX © 
print(rn) 
0.6792964339256287 
0.934692919254303 
@.3835020661354065 
@.5194163918495178 
0.8309653401374817 
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In [22]: %%cython -a 
from libc.stdlib cimport rand (1) 
cdef extern from 'limits.h': @ 
int INT_MAX @ 
def average _cy2(int n): 
cdef int i 
cdef float s = 0 
for i in range(n): 
s += rand() / INT_MAX © 
returns / n 
Out[22]: <IPython.core.display.HTML object> 


In [23]: %time average_cy2(n) 
CPU times: user 78.5 ms, sys: 422 us, total: 79 ms 
Wall time: 79.1 ms 


Out[23]: 0.500017523765564 


In [24]: %timeit average_cy2(n) 
65.4 ms + 706 us per loop (mean + std. dev. of 7 runs, 10 loops each) 


Imports a random number generator from C. 
Imports a constant value for the scaling of the random numbers. 


® Adds uniformly distributed random numbers from the interval (0, 1), after scal- 
ing. 


This further optimized Cython version, average_cy2(), is now a bit faster than the 
Numba version. However, the effort has also been a bit larger. Compared to the NumPy 
version, Cython also preserves the memory efficiency of the original loop-based 
implementation. 


Cython = Python + C 


Cython allows developers to tweak code for performance as much 
as possible or as little as sensible—starting with a pure Python ver- 
sion, for instance, and adding more and more elements from C to 
the code. The compilation step itself can also be parameterized to 
further optimize the compiled version. 


Algorithms 


This section applies the performance-enhancing techniques from the previous sec- 
tion to some well-known problems and algorithms from mathematics. These algo- 
rithms are regularly used for performance benchmarks. 
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Prime Numbers 


Prime numbers play an important role not only in theoretical mathematics but also 
in many applied computer science disciplines, such as encryption. A prime number is 
a positive natural number greater than 1 that is only divisible without remainder by 1 
and itself. There are no other factors. While it is difficult to find larger prime num- 
bers due to their rarity, it is easy to prove that a number is not prime. The only thing 
that is needed is a factor other than 1 that divides the number without a remainder. 


Python 


There are a number of algorithmic implementations available to test if numbers are 
prime. The following is a Python version that is not yet optimal from an algorithmic 
point of view but is already quite efficient. The execution time for the larger prime p2, 
however, is long: 


In [25]: def is_prime(I): 
if 1% 2 == 0: return False @ 
for i in range(3, int(I ** 0.5) + 1, 2): (2) 
if I % i == 0: return False 
return True @ 


In [26]: n = int(1e8 + 3) © 
n 
Out[26]: 100000003 


In [27]: %time is_prime(n) 
CPU times: user 35 us, sys: Ons, total: 35 ps 
Wall time: 39.1 ps 


Out[27]: False 

In [28]: p1 = int(ie8 + 7) © 
pi 

Out[28]: 100000007 

In [29]: %time is_prime(p1) 
CPU times: user 776 Us, sys: 1 ps, total: 777 ps 
Wall time: 787 us 

Out[29]: True 


In [30]: p2 = 100109100129162907 @ 


In [31]: p2.bit_length() (6) 
Out[31]: 57 


In [32]: %time is_prime(p2) 
CPU times: user 22.6 s, sys: 44.7 ms, total: 22.6 s 
Wall time: 22.7 s 
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Out[32]: True 


If the number is even, False is returned immediately. 
The loop starts at 3 and goes until the square root of I plus 1 with step size 2. 
As soon as a factor is identified the function returns False. 


If no factor is found, True is returned. 
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Relatively small non-prime and prime numbers. 
A larger prime number which requires longer execution times. 


Numba 


The loop structure of the algorithm in the function is_prime() lends itself well to 
being dynamically compiled with Numba. The overhead again is minimal but the 
speedup considerable: 


In [33]: is_prime_nb = numba.jit(is_prime) 

In [34]: %time is_prime_nb(n) (1) 
CPU times: user 87.5 ms, sys: 7.91 ms, total: 95.4 ms 
Wall time: 93.7 ms 

Out[34]: False 

In [35]: %time is_prime_nb(n) (2) 
CPU times: user 9 us, sys: 1e+03 ns, total: 10 ps 
Wall time: 13.6 us 

Out[35]: False 

In [36]: %time is_prime_nb(p1) 
CPU times: user 26 ps, sys: © ns, total: 26 ps 
Wall time: 31 ps 

Out[36]: True 

In [37]: %time is_prime_nb(p2) © 
CPU times: user 1.72 s, sys: 9.7 ms, total: 1.73 s 
Wall time: 1.74 s 


Out[37]: True 


@ The first call of is_prime_nb() involves the compiling overhead. 
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© From the second call, the speedup becomes fully visible. 


© The speedup for the larger prime is about an order of magnitude. 


Cython 


The application of Cython is straightforward as well. A plain Cython version without 
type declarations already speeds up the code significantly: 


In [38]: %%cython 
def is_prime_cy1(I): 


if I % 2 == 0: return False 
for i in range(3, int(I ** 0.5) + 1, 2): 
if I % i == 0: return False 


return True 


In [39]: %timeit is_prime(p1) 
394 us + 14.7 us per loop (mean + std. dev. of 7 runs, 1000 loops each) 


In [40]: %timeit is_prime_cy1(p1) 
243 us + 6.58 us per loop (mean + std. dev. of 7 runs, 1000 loops each) 


However, real improvements only materialize with the static type declarations. The 
Cython version then even is slightly faster than the Numba one: 


In [41]: %%cython 
def is_prime_cy2(long I): (13 
cdef long i 
if I % 2 == 0: return False 
for i in range(3, int(I ** 0.5) + 1, 2): 
if I % i == 0: return False 
return True 


In [42]: %timeit is_prime_cy2(p1) 
87.6 us + 27.7 us per loop (mean + std. dev. of 7 runs, 10000 loops each) 


In [43]: %time is_prime_nb(p2) 
CPU times: user 1.68 s, sys: 9.73 ms, total: 1.69 s 
Wall time: 1.7 s 

Out[43]: True 

In [44]: %time is_prime_cy2(p2) 
CPU times: user 1.66 s, sys: 9.47 ms, total: 1.67 s 
Wall time: 1.68 s 


Out[44]: True 


@ Static type declarations for the two variables I and i. 
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Multiprocessing 


So far, all the optimization efforts have focused on the sequential code execution. In 
particular with prime numbers, there might be a need to check multiple numbers at 
the same time. To this end, the multiprocessing module can help speed up the code 
execution further. It allows one to spawn multiple Python processes that run in paral- 
lel. The application is straightforward in the simple case at hand. First, an mp.Pool 
object is set up with multiple processes. Second, the function to be executed is map- 
ped to the prime numbers to be checked: 


In [45]: import multiprocessing as mp 
In [46]: pool = mp.Pool(processes=4) 1] 


In [47]: %time pool.map(is_prime, 10 * [p1]) (2) 
CPU times: user 1.52 ms, sys: 2.09 ms, total: 3.61 ms 
Wall time: 9.73 ms 


Out[47]: [True, True, True, True, True, True, True, True, True, True] 


In [48]: %time pool.map(is_prime_nb, 10 * [p2]) (2) 
CPU times: user 13.9 ms, sys: 4.8 ms, total: 18.7 ms 
Wall time: 10.4 s 


Out[48]: [True, True, True, True, True, True, True, True, True, True] 


In [49]: %time pool.map(is_prime_cy2, 10 * [p2]) (2) 
CPU times: user 9.8 ms, sys: 3.22 ms, total: 13 ms 
Wall time: 9.51 s 


Out[49]: [True, True, True, True, True, True, True, True, True, True] 
@ The mp.Pool object is instantiated with multiple processes. 


© Then the respective function is mapped to a list object with prime numbers. 


The observed speedup is significant. The Python function is_prime() takes more 
than 20 seconds for the larger prime number p2. Both the is_prime_nb() and the 
is_prime_cy2() functions take less than 10 seconds for 10 times the prime number 
p2 when executed in parallel with four processes. 


Parallel Processing 


Parallel processing should be considered whenever different prob- 
lems of the same type need to be solved. The effect can be huge 
when powerful hardware is available with many cores and suffi- 
cient working memory. multiprocessing is one easy-to-use mod- 
ule from the standard library. 
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Fibonacci Numbers 


Fibonacci numbers and sequences can be derived based on a simple algorithm. Start 
with two ones: 1, 1. From the third number, the next Fibonacci number is derived as 
the sum of the two preceding ones: 1, 1, 2, 3, 5, 8, 13, 21, .... This section analyzes two 
different implementations, a recursive one and an iterative one. 


Recursive algorithm 


Similar to regular Python loops, it is known that regular recursive function imple- 
mentations are relatively slow with Python. Such functions call themselves potentially 
a large number of times to come up with the final result. The function 
fib_rec_py1() presents such an implementation. In this case, Numba does not help at 
all with speeding up the execution. However, Cython shows significant speedups 
based on static type declarations only: 


In [50]: def fib_rec_pyi(n): 
tf a < 2: 
return n 
else: 
return fib_rec_py1(n - 1) + fib_rec_py1(n - 2) 


In [51]: %time fib_rec_py1(35) 
CPU times: user 6.55 s, sys: 29 ms, total: 6.58 s 
Wall time: 6.6 s 


Out[51]: 9227465 
In [52]: fib_rec_nb = numba. jit(fib_rec_py1) 


In [53]: %time fib_rec_nb(35) 
CPU times: user 3.87 s, sys: 24.2 ms, total: 3.9 s 
Wall time: 3.91 s 


Out[53]: 9227465 


In [54]: %%cython 
def fib_rec_cy(int n): 
if n<2: 
return n 
else: 
return fib_rec_cy(n - 1) + fib_rec_cy(n - 2) 


In [55]: %time fib_rec_cy(35) 
CPU times: user 751 ms, sys: 4.37 ms, total: 756 ms 
Wall time: 755 ms 


Out[55]: 9227465 
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The major problem with the recursive algorithm is that intermediate results are not 
cached but rather recalculated. To avoid this particular problem, a decorator can be 
used that takes care of the caching of intermediate results. This speeds up the execu- 
tion by multiple orders of magnitude: 


In [56]: 


In [57]: 


In [58]: 


Out[58]: 


In [59]: 


Out[59]: 


from functools import Lru_cache as cache 


@cache(maxsize=None) (1) 
def fib_rec_py2(n): 
if n <2: 
return n 
else: 


return fib_rec_py2(n - 1) + fib_rec_py2(n - 


%time fib_rec_py2(35) (2) 
CPU times: user 64 ps, sys: 28 us, total: 92 ps 
Wall time: 98 ps 


9227465 

%time fib_rec_py2(80) (2) 

CPU times: user 38 us, sys: 8 us, total: 46 US 
Wall time: 51 ps 


23416728348467685 


@ Caching intermediate results ... 


© ... leads to tremendous speedups in this case. 


Iterative algorithm 


2) 


Although the algorithm to calculate the nth Fibonacci number can be implemented 
recursively, it doesn’t have to be. The following presents an iterative implementation 
which is even in pure Python faster than the cached variant of the recursive 
implementation. This is also the terrain where Numba leads to further improvements. 
However, the Cython version comes out as the winner: 


In [60]: 


In [61]: 


Out[61]: 


def fib_it_py(n): 
xX, y=0, 1 
for i in range(1, n + 1): 
X,Y=Y¥,xX+y 
return x 


%time fib_it_py(80) 
CPU times: user 19 us, sys: 1e+03 ns, total: 20 us 
Wall time: 26 us 


23416728348467685 
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In [62]: fib_it_nb = numba. jit(fib_it_py) 


In [63]: %time fib_it_nb(80) 
CPU times: user 57 ms, sys: 6.9 ms, total: 63.9 ms 
Wall time: 62 ms 


Out[63]: 23416728348467685 


In [64]: %time fib_it_nb(80) 
CPU times: user 7 ps, sys: 1 us, total: 8 US 
Wall time: 12.2 us 


Out[64]: 23416728348467685 


In [65]: %%cython 
def fib_it_cy1(int n): 
cdef long i 
cdef long x = 0, y=1 
for i in range(1, n + 1): 
xX y =y; X+y 
return x 


In [66]: %time fib_it_cy1(80) 
CPU times: user 4 us, sys: 1e+03 ns, total: 5 ps 
Wall time: 11 us 


Out[66]: 23416728348467685 


Now that everything is so fast, one might wonder why we're just calculating the 80th 
Fibonacci number and not the 150th, for instance. The problem is with the available 
data types. While Python can basically handle arbitrarily large numbers (see “Basic 
Data Types” on page 62), this is not true in general for the compiled languages. With 
Cython one can, however, rely on a special data type to allow for numbers larger than 
the double float object with 64 bits allows for: 


In [67]: %%time 
fn = fib_rec_py2(150) (1) 
print(fn) 
9969216677189303386214405760200 
CPU times: user 361 us, sys: 115 ps, total: 476 ps 
Wall time: 430 ps 


In [68]: fn.bit_length() ©@ 
Out[68]: 103 


In [69]: %%time 
fn = fib_it_nb(150) © 
print(fn) © 
6792540214324356296 
CPU times: user 270 us, sys: 78 ps, total: 348 ps 
Wall time: 297 ps 
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© © 6 8 8 8 


In [70]: fn.bit_length() (4) 
Out[70]: 63 


In [71]: %%time 
fn = fib_it_cyi(150) © 
print(fn) 
6792540214324356296 
CPU times: user 255 us, sys: 71 ps, total: 326 ps 
Wall time: 279 us 


In [72]: fn.bit_length() @ 
Out[72]: 63 


In [73]: %%cython 
cdef extern from *: 
ctypedef int inti28 '_ int128_t' (5) 
def fib_it_cy2(int n): 
cdef int128 i © 
cdef int128 x=0, y=1 (5) 
for i in range(1, n + 1): 
X, y =y; X+y 
return x 


In [74]: %%time 
fn = fib_it_cy2(150) © 
print(fn) (6) 
9969216677189303386214405760200 
CPU times: user 280 us, sys: 115 ps, total: 395 ps 
Wall time: 328 us 


In [75]: fn.bit_length() @ 
Out[75]: 103 


The Python version is fast and correct. 
The resulting integer has a bit length of 103 (> 64). 


The Numba and Cython versions are faster but incorrect. 


They suffer from an overflow issue due to the restriction to 64-bit int objects. 


Imports the special 128-bit int object type and uses it. 


The Cython version fib_it_cy2() now is faster and correct. 
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The Number Pi 


The final algorithm analyzed in this section is a Monte Carlo simulation-based algo- 
rithm to derive digits for the number pi (1).” The basic idea relies on the fact that the 
area A of a circle is given by A = mr’. Therefore, 7 = =. For a unit circle with radius 
r = 1, it holds that n = A. The idea of the algorithm is to simulate random points with 
coordinate values (x, y), with x, y € [-1, 1]. The area of an origin-centered square 
with side length of 2 is exactly 4. The area of the origin-centered unit circle is a frac- 
tion of the area of such a square. This fraction can be estimated by Monte Carlo sim- 
ulation: count all the points in the square, then count all the points in the circle, and 
divide the number of points in the circle by the number of points in the square. The 


following example demonstrates (see Figure 10-1): 


In [76]: 


In [77]: 


In [78]: 


Out[78]: 


In [79]: 


import random 

import numpy as np 

from pylab import mpl, plt 
plt.style.use('seaborn') 
mpLl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


rn = [(random.random() * 2 - 1, random.random() * 2 - 1) 
for _ in range(500)] 


rn = np.array(rn) 

rn[:5] 

array([[ 0.45583018, -0.27676067], 
[-0.70120038, ©.15196888], 
[ 0.07224045, 0.90147321], 
[-0.17450337, -0.47660912], 
[ 0.94896746, -0.31511879]]) 


fig = plt.figure(figsize=(7, 7)) 

ax = fig.add_subplot(1, 1, 1) 

circ = plt.Circle((0, 0), radius=1, edgecolor='g', lw=2.0, 
facecolor='None' ) 

box = plt.Rectangle((-1, -1), 2, 2, edgecolor='b', alpha=0.3) 

ax.add_patch(circ) 

ax.add_patch(box) (2) 

plt.plot(ral:; 0l; cal:, il; ‘F.") © 

plt.ylim(-1.1, 1.1) 

plt.xlim(-1.1, 1.1) 


@ Draws the unit circle. 


© Draws the square with side length of 2. 


2 The examples are inspired by a post on Code Review Stack Exchange. 
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© Draws the uniformly distributed random dots. 
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0.75 
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Figure 10-1. Unit circle and square with side length 2 with uniformly distributed ran- 
dom points 


A NumPy implementation of this algorithm is rather concise but also memory- 
intensive. Total execution time given the parameterization is about one second: 


In [80]: n = int(1e7) 


In [81]: %time rn = np.random.random((n, 2)) * 2 - 1 
CPU times: user 450 ms, sys: 87.9 ms, total: 538 ms 
Wall time: 573 ms 


In [82]: rn.nbytes 
Out[82]: 160000000 


In [83]: %time distance = np.sqrt((rn ** 2).sum(axis=1)) (1 
distance[:8].round(3) 
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CPU times: user 537 ms, sys: 198 ms, total: 736 ms 
Wall time: 651 ms 


Out[83]: array([1.181, 1.061, 0.669, 1.206, 0.799, 0.579, 0.694, 0.941]) 


In [84]: %time frac = (distance <= 1.0).sum() / len(distance) e 
CPU times: user 47.9 ms, sys: 6.77 ms, total: 54.7 ms 
Wall time: 28 ms 


In [85]: pi_mcs = frac * 4 © 
pi_mcs © 
Out[85]: 3.1413396 


The distance of the points from the origin (Euclidean norm). 
The fraction of those points on the circle relative to all points. 


This accounts for the square area of 4 for the estimation of the circle area and 
therewith of n. 


mcs_pi_py() is a Python function using a for loop and implementing the Monte 
Carlo simulation in a memory-efficient manner. Note that the random numbers are 
not scaled in this case. The execution time is longer than with the NumPy version, but 
the Numba version is faster than NumPy in this case: 


In [86]: def mcs_pi_py(n): 
circle = 0 
for _ in range(n): 
x, y = random.random(), random.random() 
Uf (X ** 24 y ** 2) ** OOS <= 1: 
circle += 1 
return (4 * circle) / n 


In [87]: %time mcs_pi_py(n) 
CPU times: user 5.47 s, sys: 23 ms, total: 5.49 s 
Wall time: 5.43 s 

Out[87]: 3.1418964 

In [88]: mcs_pi_nb = numba. jit(mcs_pi_py) 

In [89]: %time mcs_pi_nb(n) 
CPU times: user 319 ms, sys: 6.36 ms, total: 326 ms 
Wall time: 326 ms 

Out[89]: 3.1422012 

In [90]: %time mcs_pi_nb(n) 


CPU times: user 284 ms, sys: 3.92 ms, total: 288 ms 
Wall time: 291 ms 
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Out[90]: 3.142066 


A plain Cython version with static type declarations only does not perform that much 
faster than the Python version. However, relying again on the random number gener- 
ation capabilities of C further speeds up the calculation considerably: 


In [91]: %%cython -a 
import random 
def mcs_pi_cy1(int n): 
cdef int i, circle = 0 
cdef float x, y 
for i in range(n): 
x, y = random.random(), random.random() 
if (x ** 24 y ** 2) ** 0.5 <= 1: 
circle += 1 
return (4 * circle) / n 
Out[91]: <IPython.core.display.HTML object> 


In [92]: %time mcs_pi_cy1(n) 
CPU times: user 1.15 s, sys: 8.24 ms, total: 1.16 s 
Wall time: 1.16 s 


Out[92]: 3.1417132 


In [93]: %%cython -a 
from libc.stdlib cimport rand 
cdef extern from 'limits.h': 
int INT_MAX 
def mcs_pi_cy2(int n): 
cdef int i, circle = 0 
cdef float x, y 
for i in range(n): 
x, y = rand() / INT_MAX, rand() / INT_MAX 
if (x ** 2-4 y ** 2) ** 0.5 <= 1; 
circle += 1 
return (4 * circle) / n 
Out[93]: <IPython.core.display.HTML object> 


In [94]: %time mcs_pi_cy2(n) 
CPU times: user 170 ms, sys: 1.45 ms, total: 172 ms 
Wall time: 172 ms 


Out[94]: 3.1419388 


Algorithm Types 


The algorithms analyzed in this section might not be directly 
related to financial algorithms. However, the advantage is that they 
are simple and easy to understand. In addition, typical algorithmic 
problems encountered in a financial context can be discussed 
within this simplified context. 
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Binomial Trees 


A popular numerical method to value options is the binomial option pricing model 
pioneered by Cox, Ross, and Rubinstein (1979). This method relies on representing 
the possible future evolution of an asset by a (recombining) tree. In this model, as in 
the Black-Scholes-Merton (1973) setup, there is a risky asset, an index or stock, and a 
riskless asset, a bond. The relevant time interval from today until the maturity of the 
option is divided in general into equidistant subintervals of length At. Given an index 
level at time s of S, the index level at t = s + At is given by S, = S,-m, where m is 


chosen randomly from {u, d} with 0 < d < e™ < u = V4! as well as u = s r is the 
constant, riskless short rate. 


Python 


The code that follows presents a Python implementation that creates a recombining 
tree based on some fixed numerical parameters for the model: 


In [95]: import math 
In [96]: S0 = 36. @ 
T=1.0 @ 


r=0.06 © 
sigma = 0.2 (4) 


In [97]: def simulate_tree(M): 


dt =T/M 

u = math.exp(sigma * math.sqrt(dt)) (6) 
d=1/u 

S = np.zeros((M + 1, M + 1)) 
S[0, 0] = SO 

z=1 


for t in range(1, M + 1): 
for i in range(z): 
S(t; t] = S[k, t-a] *.0 
S[i+1, t] = S[i, t-1] *d 
2 += 1 
return S 


Initial value of the risky asset. 
Time horizon for the binomial tree simulation. 
Constant short rate. 


Constant volatility factor. 


© 6 68 8 Ọ 


Length of the time intervals. 
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© Factors for the upward and downward movements. 


Contrary to what happens in typical tree plots, an upward movement is represented 
in the ndarray object as a sideways movement, which decreases the ndarray size 
considerably: 


In [98]: np.set_printoptions(formatter={'float': 
lambda x: '%6.2f' % x}) 


In [99]: simulate_tree(4) (1) 

Out[99]: array([[ 36.00, 39.79, 43.97, 48.59, 53.71], 
[ 0.00, 32.57, 36.00, 39.79, 43.97], 
[ 0.00, 0.00; 29.47, 32.57, 36.00], 
[ 0.00, 0.00, 0.00, 26.67, 29.47], 
[ 0.00, 0.00, 0.00, 0.00, 24.13]]) 


In [100]: %time simulate_tree(500) (2) 
CPU times: user 148 ms, sys: 4.49 ms, total: 152 ms 
Wall time: 154 ms 


Out[100]: array([[ 36.00, 36.32, 36.65, ..., 3095.69, 3123.50, 3151.57], 
[ 0.00, 35.68, 36.00, ..., 3040.81, 3068.13, 3095.69], 
[ 0.00, 0.00, 35.36, ..., 2986.89, 3013.73, 3040.81], 


[ 0.00, 0.00, 0.00, ..., 0.42, 0.42, 0.43], 
[ 0.00, 0.00, 0.00, ..., 0.00, 0.41, 0.42], 
[ 0.00, 0.00, 06.00, ..., 0.00, 0.00, 0.41]]) 


© Tree with 4 time intervals. 
© Tree with 500 time intervals. 


NumPy 


With some trickery, such a binomial tree can be created with NumPy based on fully 
vectorized code: 


In [101]: M = 4 


In [102]: up = np.arange(M + 1) 
up = np.resize(up, (M + 1, M + 1)) (1) 
up 
Out[102]: array([[0, 1, 2, 3, 4], 
[ot By 3; 4l; 
[0, 1, 2, 3, 4], 
[0, 1, 2, 3, 4], 
[0, 1, 2, 3, 4]]) 
In [103]: down = up.T * 2 (2) 


down 
Out[103]: array([[0, 0, 0, 0, 0], 
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In [104]: 
Out[104]: 


In [105]: 


In [106]: 
Out[106]: 


© 


© 


[2,2 
[4, 4, 
[6, 6, 
[8, 8 


2 


2 


up - down © 

array([[ 9, 1, 
[-2, =4,; 
[-4, 235; 
[-6, 235 
[-8, “25 


dt =T/™ 


OoOnBN 


~ 


` 


`~ 


. 
aOanABN 
` 


SO * np.exp(sigma * 


array([[ 36.00, 
[ 29.47, 
[ 24.13, 
[ 19.76, 
[ 16.18, 


39, 
32. 
26. 
ei 
Ifa 


math.sqrt(dt) * (up - down)) (4) 


7, 
S7., 
67, 
84, 
88, 


43. 
36. 
29: 
24. 
TS, 


97, 48.59, 53.71], 
00, 39.79, 43.97], 
47, 32.57, 36.00], 
13, 26.67, 29.47], 
76, 21.84, 24.13]]) 


ndarray object with gross upward movements. 
ndarray object with gross downward movements. 


ndarray object with net upward (positive) and downward (negative) movements. 


Tree for four time intervals (upper-right triangle of values). 


In the NumPy case, the code is a bit more compact. However, more importantly, NumPy 
vectorization achieves a speedup of an order of magnitude while not using more 


memory: 


In [107]: 


In [108]: 
Out[108]: 


In [109]: 


def simulate_tree_np(M): 


dt=T/™M 


up = np.arange(M + 1) 

up = np.resize(up, (M + 
down = up.transpose() * 
S = SO * np.exp(sigma * 


return S 


simulate_tree_np(4) 


array([[ 36.00, 
[ 29.47, 
[ 24.13, 
[ 19.76, 
[ 16.18, 


39. 
32. 
26. 
21; 
Ifa 


79; 
Iia 
67, 
84, 
88, 


43. 
36. 
29 
24 
To; 


%time simulate_tree_np(500) 
CPU times: user 8.72 ms, sys: 7.07 ms, total: 15.8 ms 


Wall time: 12.9 


ms 


1, M + 1)) 
2 
math.sqrt(dt) * (up - down)) 


97, @8:59,. SITI]; 
00, 39.79, 43.97], 
47, 32.57, 36.00]; 


:13, 26:67, 29.47], 


76, 21.84, 24.13]]) 
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Out[109]: array([[ 36.00, 36.32, 36.65, ..., 3095.69, 3123.50, 3151.57], 
[ 35.36, 35.68, 36.00, ..., 3040.81, 3068.13, 3095.69], 
[ 34.73, 35.05, 35.36, ..., 2986.89, 3013.73, 3040.81], 
[ 0.00, 0.00, 0.00, ..., 0.42, 0.42, 0.43], 
[ 0.00, 0.00, 0.00, ..., 0.41, 0.41, 0.42], 
[ 0.00, 0.00, 0.00, ..., 0.40, 0.41, 0.41]]) 


Numba 


This financial algorithm should be well suited to optimization through Numba 
dynamic compilation. And indeed, another speedup compared to the NumPy version 
of an order of magnitude is observed. This makes the Numba version orders of magni- 
tude faster than the Python (or rather hybrid) version: 


In [110]: simulate_tree_nb = numba. jit(simulate_tree) 


In [111]: simulate_tree_nb(4) 

Out[111]: array([[ 36.00, 39.79, 43.97, 48.59, 53.71], 
[ 0.00, 32.57, 36.00, 39.79, 43.97], 
[ 0.00, 0.00, 29.47, 32.57, 36.00], 
[ 0.00, 0.00, 0.00, 26.67, 29.47], 
[ 0.00, 0.00, 0.00, 0.00, 24.13]]) 


In [112]: %time simulate_tree_nb(500) 
CPU times: user 425 ys, sys: 193 us, total: 618 us 
Wall time: 625 us 


Out[112]: array([[ 36.00, 36.32, 36.65, ..., 3095.69, 3123.50, 3151.57], 
[ 0.00, 35.68, 36.00, ..., 3040.81, 3068.13, 3095.69], 
[ 0.00, 0.00, 35.36, ..., 2986.89, 3013.73, 3040.81], 


[ 0.00, 0.00, 0.00, ..., 0.42, 0.42, 0.43], 
[ 0.00, 0.00, 06.00, ..., 0.00, 0.41, 0.42], 
[ 0.00, 0.00, 06.00, ..., 0.00, 0.00, 0.41]]) 


In [113]: %timeit simulate_tree_nb(500) 
559 us + 46.1 us per loop (mean + std. dev. of 7 runs, 1000 loops each) 


Cython 


As before, Cython requires more adjustments to the code to see significant improve- 
ments. The following version uses mainly static type declarations and certain imports 
that improve the performance compared to the regular Python imports and func- 
tions, respectively: 
In [114]: %%cython -a 
import numpy as np 


cimport cython 
from Llibc.math cimport exp, sqrt 
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Out[114]: <IPython.core.display.HTML object> 


cdef float S0 = 3 


cdef float T = 
cdef float r = 


cdef float sigma 


6. 
0 
06 
= 0.2 


def simulate_tree_cy(int M): 
cdef int z, t, i 


cdef float dt, u, d 


cdef float[:, :] S = np.zeros((M + 1, M+ 1), 


dt =T/M 
u= 
d=1/u 


S[0, 0] = SO 


z=1 


exp(sigma * sqrt(dt)) 


for t in range(1, M + 1): 
for i in range(z): 


S[i, t] = S[i, t-1] * u 
S[i+1, t] = S[i, t-1] * d 


Z+=1 


return np.array(S) 


dtype=np.float32) (1) 


@ Declaring the ndarray object to be a C array is critical for performance. 


The Cython version shaves off another 30% of the execution time compared to the 
Numba version: 


In [115]: simulate_tree_cy(4) 


Out[115]: 


In [116]: 


Out[116]: 


In [117]: %timeit S = simulate_tree_cy(500) 
29.5 us per loop (mean + std. dev. of 7 runs, 


array([[ 36.00, 


39.19; 
32.57, 
0.00, 
0.00, 
0.00, 


43.97, 
36.00, 
29.47, 
0.00, 
0.00, 


%time simulate_tree_cy(500) 


CPU times: 


user 


Wall time: 2.45 ms 


array([[ 36.00, 
[ 0.00, 
[ 0.00, 


0 
-00, 
0.00, 


tone. 


d 


363 ws £ 


36.32, 
35.68, 
0.00, 


0.00, 
0.00, 
0.00, 


ype=float32) 


48.59, 
3272, 
32.57 5 
26.67, 
0.00, 


ey 
ees 


ey 


53.71], 
43.97], 
36.00], 
29.47], 


2.21 ms, sys: 1.89 ms, total: 4.1 ms 


3095.77, 3123.59, 
3040.89, 3068.21, 


2986.97, 3013.81, 
0.42, 0.42, 
0.00, 0.41, 
0.00, 0.00, 


24.13]], dtype=float32) 


3151.65], 
3095.77], 
3040.89], 


0.43], 


0.42], 
0.41]], 


1000 loops each) 
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Monte Carlo Simulation 


Monte Carlo simulation is an indispensable numerical tool in computational finance. 
It has been in use since long before the advent of modern computers. Banks and 
other financial institutions use it, among others, for pricing and risk management 
purposes. As a numerical method it is perhaps the most flexible and powerful one in 
finance. However, it often also is the most computationally demanding one. That is 
why Python was long dismissed as a proper programming language to implement 
algorithms based on Monte Carlo simulation—at least for real-world application 
scenarios. 


This section analyzes the Monte Carlo simulation of the geometric Brownian motion, 
a simple yet still widely used stochastic process to model the evolution of stock prices 
or index levels. Among others, the Black-Scholes-Merton (1973) theory of option 
pricing draws on this process. In their setup the underlying of the option to be valued 
follows the stochastic differential equation (SDE), as seen in Equation 10-1. S, is the 
value of the underlying at time t; r is the constant, riskless short rate; o is the constant 
instantaneous volatility; and Z, is a Brownian motion. 


Equation 10-1. Black-Scholes-Merton SDE (geometric Brownian motion) 
dS, = rS,dt + 0S,dZ, 


This SDE can be discretized over equidistant time intervals and simulated according 
to Equation 10-2, which represents an Euler scheme. In this case, z is a standard nor- 
mally distributed random number. For M time intervals, the length of the time 
interval is given as At = z where T is the time horizon for the simulation (for exam- 
ple, the maturity date of an option to be valued). 


Equation 10-2. Black-Scholes-Merton difference equation (Euler scheme) 


At + ovate) 


g? 
2 


S, = Si-ar EXP (: - 


The Monte Carlo estimator for a European call option is then given by Equation 
10-3, where S+ (i) is the ith simulated value of the underlying at maturity T for a total 
number of simulated paths I with i = 1, 2, ..., I. 


Equation 10-3. Monte Carlo estimator for European call option 


1 
C, = evs) max (S,(i) - K, 0) 
I 
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Python 


First, a Python—or rather a hybrid—version, mcs_simulation_py(), that implements 
the Monte Carlo simulation according to Equation 10-2. It is hybrid since it imple- 
ments Python loops on ndarray objects. As seen previously, this might make for a 
good basis to dynamically compile the code with Numba. As before, the execution time 
sets the benchmark. Based on the simulation, a European put option is valued: 


In [118]: m = 100 @ 
I = 50000 @ 


In [119]: def mcs_simulation_py(p): 
M I= p 
dt=T/M 
S = np.zeros((M + 1, I)) 
s[0] = SO 
rn = np.random.standard_normal(S.shape) © 
for t in range(1, M + 1): 
for i in range(I): (4) 
S[t, i] = S[t-1, i] * math.exp((r - sigma ** 2 / 2) * dt + 
sigma * math.sqrt(dt) * rn[t, i]) (4) 
return S 


In [120]: %time S = mcs_simulation_py((M, I)) 
CPU times: user 5.55 s, sys: 52.9 ms, total: 5.6 s 
Wall time: 5.62 s 


S[-1].mean() (5) 
38.22291254503985 


In [121]: 


Out[121]: 
In [122]: SO * math.exp(r * T) Q 

Out[122]: 38.22611567563295 

In [123]: K = 40. @ 

In [124]: CO = math.exp(-r * T) * np.maximum(K - S[-1], 0).mean() 8] 


In [125]: co # © 


Out[125]: 3.860545188088036 


The number of time intervals for discretization. 
The number of paths to be simulated. 
The random numbers, drawn in a single vectorized step. 


The nested loop implementing the simulation based on the Euler scheme. 


© 6 © 8 Ọ 


The mean end-of-period value based on the simulation. 
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© The theoretically expected end-of-period value. 
@ The strike price of the European put option. 


© The Monte Carlo estimator for the option. 


Figure 10-2 shows a histogram of the simulated values at the end of the simulation 
period (maturity of the European put option). 
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Figure 10-2. Frequency distribution of the simulated end-of-period values 


NumPy 


The NumPy version, mcs_simulation_np(), is not too different. It still has one Python 
loop, namely over the time intervals. The other dimension is handled by vectorized 
code over all paths. It is about 20 times faster than the first version: 


In [127]: def mcs_simulation_np(p): 
M, lT=p 
dt =T/M 
S = np.zeros((M + 1, I)) 
s[0] = So 
rn = np.random.standard_normal(S.shape) 
for t in range(i, M+ 1): @ 
S[t] = S[t-1] * np.exp((r - sigma ** 2 / 2) * dt + 
sigma * math.sqrt(dt) * rn[t]) e 
return S 


In [128]: %time S = mcs_simulation_np((M, I)) 
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CPU times: user 252 ms, sys: 32.9 ms, total: 285 ms 
Wall time: 252 ms 


In [129]: S[-1].mean() 
Out[129]: 38.235136032258595 


In [130]: %timeit S = mcs_simulation_np((M, I)) 
202 ms + 27.7 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


© The loop over the time intervals. 


© The Euler scheme with vectorized NumPy code handling all paths at once. 


Numba 


It should not come as a surprise anymore that Numba is applied to such an algorithm 
type easily, and with significant performance improvements. The Numba version, 
mcs_simulation_nb(), is slightly faster than the NumPy version: 


In [131]: mcs_simulation_nb = numba. jit(mcs_simulation_py) 


In [132]: %time S = mcs_simulation_nb((M, I)) (1 
CPU times: user 673 ms, sys: 36.7 ms, total: 709 ms 
Wall time: 764 ms 


In [133]: %time S = mcs_simulation_nb((M, I)) (2) 
CPU times: user 239 ms, sys: 20.8 ms, total: 259 ms 
Wall time: 265 ms 


In [134]: S[-1].mean() 
Out[134]: 38.22350694016539 


In [135]: CO = math.exp(-r * T) * np.maximum(K - S[-1], 0).mean() 


In [136]: CO 
Out[136]: 3.8303077438193833 


In [137]: %timeit S = mcs_simulation_nb((M, I)) (2) 
248 ms + 20.6 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


@ First call with compile-time overhead. 


@ Second call without that overhead. 


Cython 


With Cython, again not surprisingly, the effort required to speed up the code is 
higher. However, the speedup itself is not greater. The Cython version, mcs_simula 
tion_cy(), seems to be even a bit slower compared to the NumPy and Numba versions. 
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Among other factors, some time is needed to transform the simulation results to an 
ndarray object: 


In [138]: %%cython 
import numpy as np 
cimport numpy as np 
cimport cython 
from Libc.math cimport exp, sqrt 
cdef float SO = 36. 
cdef float T = 1.0 
cdef float r = 0.06 
cdef float sigma = 0.2 
@cython.boundscheck(False) 
d@cytnon.wraparoul d(False) 
def mcs_simulation_cy(p): 
cdef int M, I 
M, l=p 
cdef int t, i 
cdef float dt =T 
cdef double[:, :] 
cdef double[:, :] 
S[0] = SO 
for t in range(1, M+ 1): 
for i in range(I): 
S[t, i] = S[t-1, i] * exp((r - sigma ** 2 / 2) * dt + 
sigma * sqrt(dt) * rn[t, i]) 


/™M 
S = np.zeros((M + 1, I)) 
rn = np.random.standard_normal((M + 1, I)) 


return np.array(S) 


In [139]: %time S = mcs_simulation_cy((M, I)) 
CPU times: user 237 ms, sys: 65.2 ms, total: 302 ms 
Wall time: 271 ms 


In [140]: S[-1].mean() 
Out[140]: 38.241735841791574 


In [141]: %timeit S = mcs_simulation_cy((M, I)) 
221 ms + 9.26 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


Multiprocessing 


Monte Carlo simulation is a task that lends itself well to parallelization. One 
approach would be to parallelize the simulation of 100,000 paths, say, into 10 pro- 
cesses simulating 10,000 paths each. Another would be to parallelize the simulation 
of the 100,000 paths into multiple processes, each simulating a different financial 
instrument, for example. The former case—namely, the parallel simulation of a larger 
number of paths based on a fixed number of separate processes—is illustrated in 
what follows. 


The following code again makes use of the multiprocessing module. It divides the 
total number of paths to be simulated I into smaller chunks of size > with p > 0. After 
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all the single tasks are finished, the results are put together in a single ndarray object 
via np. hstack(). This approach can be applied to any of the versions presented pre- 
viously. For the particular parameterization chosen here, there is no speedup to be 
observed through this parallelization approach: 


In [142]: import multiprocessing as mp 
In [143]: pool = mp.Pool(processes=4) (13 
In [144]: p=20 @ 


In [145]: %timeit S = np.hstack(pool.map(mcs_simulation_np, 
p * [(M, int(I / p))])) 


288 ms + 10.2 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


In [146]: %timeit S = np.hstack(pool.map(mcs_simuLation_nb, 
p * [(M, int(I / p))])) 


258 ms + 8.69 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 


In [147]: %timeit S = np.hstack(pool.map(mcs_simuLlation_cy, 
p * [(M, int(I / p))])) 


274 ms + 11.9 ms per loop (mean + std. dev. of 7 runs, 1 loop each) 
@ The Poot object for parallelization. 


© The number of chunks into which the simulation is divided. 


Multiprocessing Strategies 


In finance, there are many algorithms that are useful for paralleli- 
zation. Some of these even allow the application of different strate- 
gies to parallelize the code. Monte Carlo simulation is a good 
example in that multiple simulations can easily be executed in par- 
allel, either on a single machine or on multiple machines, and that 
the algorithm itself allows a single simulation to be distributed over 
multiple processes. 


Recursive pandas Algorithm 


This section addresses a somewhat special topic which is, however, an important one 
in financial analytics: the implementation of recursive functions on financial time 
series data stored in a pandas DataFrame object. While pandas allows for sophistica- 
ted vectorized operations on DataFrame objects, certain recursive algorithms are hard 
or impossible to vectorize, leaving the financial analyst with slowly executed Python 
loops on DataFrame objects. The examples that follow implement what is called the 
exponentially weighted moving average (EWMA) ina simple form. 
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The EWMA for a financial time series S,, t € {0, --- , T}, is given by Equation 10-4. 


Equation 10-4. Exponentially weighted moving average (EWMA) 
EWMA, = Sy 
EWMA, = «a-S,+(1-«a)-EWMA,,,t € {l, =, T} 


Although simple in nature and straightforward to implement, such an algorithm 
might lead to rather slow code. 


Python 


Consider first the Python version that iterates over the DatetimeIndex of a Data 
Frame object containing financial time series data for a single financial instrument 
(see Chapter 8). Figure 10-3 visualizes the financial time series and the EWMA time 
series: 


In [148]: import pandas as pd 


In [149]: sym = 'SPY' 


In [150]: data = pd.DataFrame(pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True)[sym]).dropna() 


In [151]: alpha = 0.25 


In [152]: data['EWMA'] = data[sym] (13 


In [153]: %%time 
for t in zip(data.index, data.index[1:]): 
data.loc[t[1], 'EWMA'] = (alpha * data.loc[t[1], sym] + 
(1 - alpha) * data.loc[t[0], 'Ewma']) @ 
CPU times: user 588 ms, sys: 16.4 ms, total: 605 ms 
Wall time: 591 ms 


In [154]: data.head() 

Out[154]: SPY EWMA 
Date 
2010-01-04 113.33 113.330000 
2010-01-05 113.63 113.405000 
2010-01-06 113.71 113.481250 
2010-01-07 114.19 113.658438 
2010-01-08 114.57 113.886328 


In [155]: data[data.index > '2017-1-1'].plot(figsize=(10, 6)); 
@ Initializes the EWMA column. 


© Implements the algorithm based on a Python loop. 
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Figure 10-3. Financial time series with EWMA 


Now consider more general Python function ewma_py(). It can be applied directly on 
the column or the raw financial times series data in the form of an ndarray object: 


In [156]: def ewma_py(x, alpha): 
y = np.zeros_like(x) 
y[0] = x[0] 
for i in range(1, len(x)): 
y[i] = alpha * x[i] + (1-alpha) * y[i-1] 
return y 


In [157]: %time data['EWMA_PY'] = ewma_py(data[sym], alpha) (13 
CPU times: user 33.1 ms, sys: 1.22 ms, total: 34.3 ms 
Wall time: 33.9 ms 


In [158]: %time data['EWMA_PY'] = ewma_py(data[sym].values, alpha) (2) 


CPU times: user 1.61 ms, sys: 44 us, total: 1.65 ms 
Wall time: 1.62 ms 


@ Applies the function to the Series object directly (i.e., the column). 


@ Applies the function to the ndarray object containing the raw data. 


This approach already speeds up the code execution considerably—by a factor of 
from about 20 to more than 100. 
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Numba 


The very structure of this algorithm promises further speedups when applying Numba. 
And indeed, when the function ewma_nb() is applied to the ndarray version of the 
data, the speedup is again by an order of magnitude: 


In [159]: 


In [160]: 


In [161]: 


In [162]: 


In [163]: 


ewma_nb = numba. jit(ewma_py) 


%time data['EWMA_NB'] = ewma_nb(data[sym], alpha) 1) 
CPU times: user 269 ms, sys: 11.4 ms, total: 280 ms 
Wall time: 294 ms 


%timeit data['EWMA_NB'] = ewma_nb(data[sym], alpha) (13 
30.9 ms + 1.21 ms per loop (mean + std. dev. of 7 runs, 10 loops each) 


%time data['EWMA_NB'] = ewma_nb(data[sym].values, alpha) (2) 
CPU times: user 94.1 ms, sys: 3.78 ms, total: 97.9 ms 
Wall time: 97.6 ms 


%timeit data['EWMA_NB'] = ewma_nb(data[sym].values, alpha) (2) 
134 us + 12.5 us per loop (mean + std. dev. of 7 runs, 10000 loops each) 


@ Applies the function to the Series object directly (i.e., the column). 


@ Applies the function to the ndarray object containing the raw data. 


Cython 


The Cython version, ewma_cy(), also achieves considerable speed improvements but 
it is not as fast as the Numba version in this case: 


In [164]: 


In [165]: 


In [166]: 


%%cython 
import numpy as np 
cimport cython 
acvthor ` nder jeck (False) 
@cython.wraparound(False) 
def ewma_cy(double[:] x, float alpha): 

cdef int i 

cdef double[:] y = np.empty_like(x) 

y[0] = x[0] 

for i in range(1, len(x)): 

y[i] = alpha * x[i] + (1 - alpha) * y[i - 1] 
return y 


%time data['EWMA_CY'] = ewma_cy(data[sym].values, alpha) 
CPU times: user 2.98 ms, sys: 1.41 ms, total: 4.4 ms 
Wall time: 5.96 ms 


%timeit data['EWMA_CY'] = ewma_cy(data[sym].values, alpha) 
1.29 ms + 194 us per loop (mean + std. dev. of 7 runs, 1000 loops each) 
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This final example illustrates again that there are in general multiple options to 
implement (nonstandard) algorithms. All options might lead to exactly the same 
results, while also showing considerably different performance characteristics. The 
execution times in this example range from 0.1 ms to 500 ms—a factor of 5,000 times. 


Best Versus First-Best 


It is easy in general to translate algorithms to the Python program- 
ming language. However, it is equally easy to implement algo- 
rithms in a way that is unnecessarily slow given the menu of 
performance options available. For interactive financial analytics, a 
first-best solution—i.e., one that does the trick but which might not 
be the fastest possible nor the most memory-efficient one—might 
be perfectly fine. For financial applications in production, one 
should strive to implement the best solution, even though this 
might involve a bit more research and some formal benchmarking. 


Conclusion 


The Python ecosystem provides a number of ways to improve the performance of 
code: 


Idioms and paradigms 
Some Python paradigms and idioms might be more performant than others, 
given a specific problem; in many cases, for instance, vectorization is a paradigm 
that not only leads to more concise code but also to higher speeds (sometimes at 
the cost of a larger memory footprint). 


Packages 
There are a wealth of packages available for different types of problems, and 
using a package adapted to the problem can often lead to much higher perfor- 
mance; good examples are NumPy with the ndarray class and pandas with the 
DataFrame class. 


Compiling 
Powerful packages for speeding up financial algorithms are Numba and Cython for 
the dynamic and static compilation of Python code. 


Parallelization 
Some Python packages, such as multiprocessing, allow for the easy paralleliza- 
tion of Python code; the examples in this chapter only use parallelization on a 
single machine but the Python ecosystem also offers technologies for multi- 
machine (cluster) parallelization. 
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A major benefit of the performance approaches presented in this chapter is that they 
are in general easily implementable, meaning that the additional effort required is 
regularly low. In other words, performance improvements often are low-hanging 
fruit given the performance packages available as of today. 


Further Resources 


For all the performance packages introduced in this chapter, there are helpful web 
resources available: 


+ http://cython.org is the home of the Cython package and compiler project. 


e The documentation for the multiprocessing module is found at https:// 
docs.python.org/3/library/multiprocessing.html. 


e Information on Numba can be found at http://github.com/numba/numba and 
https://numba.pydata. org. 
For references in book form, see the following: 
e Gorelick, Misha, and Ian Ozsvald (2014). High Performance Python. Sebastopol, 
CA: O'Reilly. 
e Smith, Kurt (2015). Cython. Sebastopol, CA: O’Reilly. 
Original papers cited in this chapter: 
e Black, Fischer, and Myron Scholes (1973). “The Pricing of Options and Corpo- 


rate Liabilities.” Journal of Political Economy, Vol. 81, No. 3, pp. 638-659. 


e Cox, John, Stephen Ross, and Mark Rubinstein (1979). “Option Pricing: A Sim- 
plified Approach.” Journal of Financial Economics, Vol. 7, No. 3, pp. 229-263. 


e Merton, Robert (1973). “Theory of Rational Option Pricing.” Bell Journal of Eco- 
nomics and Management Science, Vol. 4, pp. 141-183. 
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CHAPTER 11 
Mathematical Tools 


The mathematicians are the priests of the modern world. 


—Bill Gaede 


Since the arrival of the so-called Rocket Scientists on Wall Street in the 1980s and 
1990s, finance has evolved into a discipline of applied mathematics. While early 
research papers in finance came with lots of text and few mathematical expressions 
and equations, current ones are mainly comprised of mathematical expressions and 
equations with some explanatory text around. 


This chapter introduces some useful mathematical tools for finance, without provid- 
ing a detailed background for each of them. There are many useful books available on 
this topic, so this chapter focuses on how to use the tools and techniques with 
Python. In particular, it covers: 


“Approximation” on page 312 
Regression and interpolation are among the most often used numerical techni- 
ques in finance. 


“Convex Optimization” on page 328 
A number of financial disciplines need tools for convex optimization (for 
instance, derivatives analytics when it comes to model calibration). 


“Integration” on page 334 
In particular, the valuation of financial (derivative) assets often boils down to the 
evaluation of integrals. 


“Symbolic Computation” on page 337 
Python provides with SymPy a powerful package for symbolic mathematics, for 
example, to solve (systems of) equations. 
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Approximation 


To begin with, the usual imports: 


In [1]: import numpy as np 
from pylab import plt, mpl 


In [2]: plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


Throughout this section, the main example function is the following, which is com- 
prised of a trigonometric term and a linear term: 
In [3]: def f(x): 
return np.sin(x) + 0.5 * x 

The main focus is the approximation of this function over a given interval by regres- 
sion and interpolation techniques. First, a plot of the function to get a better view of 
what exactly the approximation shall achieve. The interval of interest shall be [-27, 
2r]. Figure 11-1 displays the function over the fixed interval defined via the np.lin 
space() function. The function create_plot() is a helper function to create the 
same type of plot required multiple times in this chapter: 


In [4]: def create_plot(x, y, styles, labels, axlabels): 
plt.figure(figsize=(10, 6)) 
for i in range(len(x)): 
plt.plot(x[i], y[i], styles[i], label=lLabels[i]) 
plt.xlabel(axlabels[0]) 
plt.ylabel(axlabels[1]) 
plt. legend(loc=0) 


In [5]: x = np.linspace(-2 * np.pi, 2 * np.pi, 50) (13 
In [6]: create_plot([x], [f(x)], ['b'], ['f(x)'], ['x', 'f(x)']) 


© The x values used for the plotting and the calculations. 
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Figure 11-1. Example function plot 


Regression 


Regression is a rather efficient tool when it comes to function approximation. It is 
not only suited to approximating one-dimensional functions but also works well in 
higher dimensions. The numerical techniques needed to come up with regression 
results are easily implemented and quickly executed. Basically, the task of regression, 
given a set of so-called basis functions b,, d € {1, . =+ , D}, is to find optimal param- 
eters a, * , &p according to Equation 11-1, where y; = f (x) for i € {1, =, I} 
observation points. The x, are considered independent observations and the y, depen- 
dent observations (in a functional or statistical sense). 


Equation 11-1. Minimization problem of regression 


1 I D 2 
min FÈ (y; - È ag- ba(x;)) 
ajo il d=1 


Monomials as basis functions 


One of the simplest cases is to take monomials as basis functions—i.e., 
bi = 1, b, = x, b = x’, b= x°, -+ . In such a case, NumPy has built-in functions for 
both the determination of the optimal parameters (namely, np.polyfit()) and the 
evaluation of the approximation given a set of input values (namely, np.polyval()). 
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Table 11-1 lists the parameters the np.polyfit() function takes. Given the returned 
optimal regression coefficients p from np.polyfit(), np.polyval(p, x) then 
returns the regression values for the x coordinates. 


Table 11-1. Parameters of polyfit() function 


Parameter Description 


x x coordinates (independent variable values) 

y y coordinates (dependent variable values) 

deg Degree of the fitting polynomial 

full If True, returns diagnostic information in addition 
w Weights to apply to the y coordinates 

cov If True, returns covariance matrix in addition 


In typical vectorized fashion, the application of np.polyfit() and np.polyval() 
takes on the following form for a linear regression (i.e., for deg=1). Given the regres- 
sion estimates stored in the ry array, we can compare the regression result with the 
original function as presented in Figure 11-2. Of course, a linear regression cannot 
account for the sin part of the example function: 


In [7]: res = np.polyfit(x, f(x), deg=1, full=True) (13 


In [8]: res @ 

Out[8]: (array([ 4.28841952e-01, -1.31499950e-16]), 
array([21.03238686]), 
2; 
array([i., 1.]), 


1.1102230246251565e-14) 
In [9]: ry = np.polyval(res[0], x) © 


In [10]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
['f(x)', 'regression'], ['x', 'f(x)']) 


@ Linear regression step. 


© Full results: regression parameters, residuals, effective rank, singular values, and 
relative condition number. 


© Evaluation using the regression parameters. 
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Figure 11-2. Linear regression 


To account for the sin part of the example function, higher-order monomials are 
necessary. The next regression attempt takes monomials up to the order of 5 as basis 
functions. It should not be too surprising that the regression result, as seen in 
Figure 11-3, now looks much closer to the original function. However, it is still far 
from being perfect: 


In [11]: reg = np.polyfit(x, f(x), deg=5) 
ry = np.polyval(reg, x) 


In [12]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
['f(x)', 'regression'], ['x', 'f(x)']) 
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Figure 11-3. Regression with monomials up to order 5 
The last attempt takes monomials up to order 7 to approximate the example func- 
tion. In this case the result, as presented in Figure 11-4, is quite convincing: 


In [13]: reg = np.polyfit(x, f(x), 7) 
ry = np.polyval(reg, x) 


In [14]: np.allclose(f(x), ry) (13 
Out[14]: False 


In [15]: np.mean((f(x) - ry) ** 2) @ 
Qut[15]: 0.0017769134759517689 


In [16]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
['f(x)', 'regression'], ['x', 'f(x)']) 


@ Checks whether the function and regression values are the same (or at least 
close). 


@ Calculates the Mean Squared Error (MSE) for the regression values given the 
function values. 
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Figure 11-4. Regression with monomials up to order 7 


Individual basis functions 


In general, one can reach better regression results by choosing better sets of basis 
functions, e.g., by exploiting knowledge about the function to approximate. In this 
case, the individual basis functions have to be defined via a matrix approach (i.e., 
using a NumPy ndarray object). First, the case with monomials up to order 3 
(Figure 11-5). The central function here is np. Linalg.lstsq(): 


In [17]: matrix = np.zeros((3 + 1, len(x))) (1) 


matrix[3, :] = x ** 3 (2) 
matrix[2, :] = x ** 2 (2) 
matrix[1, :] = x (2) 
matrix[0, :] = 1 (2) 


In [18]: reg = np.linalg.lstsq(matrix.T, f(x), rcond=None)[0] © 


In [19]: reg.round(4) (4) 
Out[19]: array([ 0. » 0.5628, -0. » -0.0054]) 


In [20]: ry = np.dot(reg, matrix) (5) 


In [21]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
[*fCx)"; "regression l; [X “TOOD 


@ The ndarray object for the basis function values (matrix). 


© The basis function values from constant to cubic. 
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© The regression step. 
© The optimal regression parameters. 


© The regression estimates for the function values. 


— f(x) 
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Figure 11-5. Regression with individual basis functions 


The result in Figure 11-5 is not as good as expected based on our previous experience 
with monomials. Using the more general approach allows us to exploit knowledge 
about the example function—namely that there is a sin part in the function. There- 
fore, it makes sense to include a sine function in the set of basis functions. For sim- 
plicity, the highest-order monomial is replaced. The fit now is perfect, as the numbers 
and Figure 11-6 illustrate: 


In [22]: matrix[3, :] = np.sin(x) (1) 
In [23]: reg = np.linalg.lstsq(matrix.T, f(x), rcond=None)[0] 


In [24]: reg.round(4) (2) 
Out[24]: array([0. , 0.5, 0. , 1. ]) 


In [25]: ry = np.dot(reg, matrix) 


In [26]: np.allclose(f(x), ry) ® 
Out[26]: True 


In [27]: np.mean((f(x) - ry) ** 2) © 
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Out[27]: 3.404735992885531e-31 


In [28]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
['f(x)', 'regression'], ['x', 'f(x)']) 


The new basis function exploiting knowledge about the example function. 
The optimal regression parameters recover the original parameters. 


The regression now leads to a perfect fit. 


— fies) 
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Figure 11-6. Regression with the sine basis function 


Noisy data 


Regression can cope equally well with noisy data, be it data from simulation or from 
(nonperfect) measurements. To illustrate this point, independent observations with 
noise and dependent observations with noise are generated. Figure 11-7 reveals that 
the regression results are closer to the original function than the noisy data points. In 
a sense, the regression averages out the noise to some extent: 


In [29]: xn = np.linspace(-2 * np.pi, 2 * np.pi, 50) (1 
xn = xn + 0.15 * np.random.standard_normal(len(xn)) (2) 
yn = f(xn) + 0.25 * np.random.standard_normal(len(xn)) © 


In [30]: reg = np.polyfit(xn, yn, 7) 
ry = np.polyval(reg, xn) 
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In [31]: create_plot([x, x], [f(x), ry], ['b', 'r.'], 
['f(x)', 'regression'], ['x', 'f(x)']) 


The new deterministic x values. 
Introducing noise to the x values. 


Introducing noise to the y values. 


—— fies) 
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Figure 11-7. Regression for noisy data 


Unsorted data 


Another important aspect of regression is that the approach also works seamlessly 
with unsorted data. The previous examples all rely on sorted x data. This does not 
have to be the case. To make the point, let’s look at yet another randomization 
approach for the x values. In this case, one can hardly identify any structure by just 
visually inspecting the raw data: 


In [32]: xu = np.random.rand(50) * 4 * np.pi - 2 * np.pi (1) 
yu = f(xu) 


In [33]: print(xu[:10].round(2)) (1) 
print(yu[:10].round(2)) 0 
[-4.17 -0.11 -1.91 2.33 3.34 -0.96 5.81 4.92 -4.56 -5.42] 
[-1.23 -0.17 -1.9 1.89 1.47 -1.29 2.45 1.48 -1:29 -1.95] 


In [34]: reg = np.polyfit(xu, yu, 5) 
ry = np.polyval(reg, xu) 
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In [35]: create_plot([xu, xu], [yu, ry], ['b.', 'ro'], 
['f(x)', 'regression'], ['x', 'f(x)']) 


@ Randomizes the x values. 


As with the noisy data, the regression approach does not care for the order of the 
observation points. This becomes obvious upon inspecting the structure of the mini- 
mization problem in Equation 11-1. It is also obvious by the results, presented in 
Figure 11-8. 
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Figure 11-8. Regression for unsorted data 


Multiple dimensions 


Another convenient characteristic of the least-squares regression approach is that it 
carries over to multiple dimensions without too many modifications. As an example 
function take fm(), as presented next: 
In [36]: def fm(p): 

X, Y=P 

return np.sin(x) + 0.25 * x + np.sqrt(y) + 0.05 * y ** 2 
To properly visualize this function, grids (in two dimensions) of independent data 
points are needed. Based on such two-dimensional grids of independent and result- 
ing dependent data points, embodied in the following by X, Y, and Z, Figure 11-9 
presents the shape of the function fm(): 
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In [37]: x = np.linspace(0, 10, 20) 
y = np.linspace(0, 10, 20) 
X 


» Y = np.meshgrid(x, y) (1) 


In [38]: Z = fm((X, Y)) 
x = X.flatten() @ 
y = Y.flatten() @ 


In [39]: from mpl_toolkits.mplot3d import Axes3D © 


In [40]: fig = plt.figure(figsize=(10, 6)) 

ax = fig.gca(projection='3d') 

surf = ax.plot_surface(X, Y, Z, rstride=2, cstride=2, 
cmap='coolwarm', Linewidth=0.5, 
antialiased=True) 

ax.set_xLabel('x') 

ax.set_ylabel('y') 

ax.set_zlabel('f(x, y)') 

fig.colorbar(surf, shrink=0.5, aspect=5) 


@ Generates 2D ndarray objects (“grids”) out of the 1D ndarray objects. 
@ Yields 1D ndarray objects from the 2D ndarray objects. 


© Imports the 3D plotting capabilities from matplotlib as required. 


Figure 11-9. The function with two parameters 
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To get good regression results, the set of basis functions is essential. Therefore, fac- 
toring in knowledge about the function fm() itself, both an np.sin() and an 
np.sqrt() function are included. Figure 11-10 shows the perfect regression results 
visually: 


© © © 8 8 


In [41]: matrix = np.zeros((len(x), 6 + 1)) 
matrix[:, 6] = np.sqrt(y) 1) 
matrix[:, 5] = np.sin(x) 
matrix[:, 4] = y ** 2 
matrix[:, 3] = x ** 2 
matrix[:, 2] = y 
matrix[:, 1] = x 
matrix[:, 0] =1 


In [42]: reg = np.linalg.lstsq(matrix, fm((x, y)), rcond=None)[0] 
In [43]: RZ = np.dot(matrix, reg).reshape((20, 20)) © 


In [44]: fig = plt.figure(figsize=(10, 6)) 

ax = fig.gca(projection='3d') 

surf1 = ax.plot_surface(X, Y, Z, rstride=2, cstride=2, 
cmap=mpl.cm.coolwarm, linewidth=0.5, 
antialiased=True) 

surf2 = ax.plot_wireframe(X, Y, RZ, rstride=2, cstride=2, 

label='regression') (5) 

ax.set_xlabel('x') 

ax.set_ylabel('y') 

ax.set_zlabel('f(x, y)') 

ax.legend() 

fig.colorbar(surf, shrink=0.5, aspect=5) 


The np.sqrt() function for the y parameter. 

The np.sin() function for the x parameter. 
Transforms the regression results to the grid structure. 
Plots the original function surface. 


Plots the regression surface. 
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Figure 11-10. Regression surface for function with two parameters 


Regression 


Least-squares regression approaches have multiple areas of appli- 
cation, including simple function approximation and function 
approximation based on noisy or unsorted data. These approaches 
can be applied to one-dimensional as well as multidimensional 
problems. Due to the underlying mathematics, the application is 
“almost always the same.” 


Interpolation 


Compared to regression, interpolation (e.g., with cubic splines) is more involved 
mathematically. It is also limited to low-dimensional problems. Given an ordered set 
of observation points (ordered in the x dimension), the basic idea is to do a regres- 
sion between two neighboring data points in such a way that not only are the data 
points perfectly matched by the resulting piecewise-defined interpolation function, 
but also the function is continuously differentiable at the data points. Continuous dif- 
ferentiability requires at least interpolation of degree 3—i.e., with cubic splines. How- 
ever, the approach also works in general with quadratic and even linear splines. 


The following code implements a linear splines interpolation, the result of which is 
shown in Figure 11-11: 
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In [45]: import scipy.interpolate as spi (13 
In [46]: x = np.linspace(-2 * np.pi, 2 * np.pi, 25) 


In [47]: def f(x): 
return np.sin(x) + 0.5 * x 


In [48]: ipo = spi.splrep(x, f(x), k=1) (2) 
In [49]: iy = spi.splev(x, ipo) © 


In [50]: np.allclose(f(x), iy) (43 
Out[50]: True 


In [51]: create_plot([x, x], [f(x), iy], ['b', 'ro'], 
[“fCx)" 3. "interpolation" l; [*x", "ERX T) 


Imports the required subpackage from SciPy. 
Implements a linear spline interpolation. 


Derives the interpolated values. 


© © 8 8 


Checks whether the interpolated values are close (enough) to the function values. 


— f(x) 
@ interpolation 


Figure 11-11. Linear splines interpolation (complete data set) 


The application itself, given an x-ordered set of data points, is as simple as the appli- 
cation of np.polyfit() and np.polyval(). Here, the respective functions are 
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sci.splrep() and sci.splev(). Table 11-2 lists the major parameters that the 
sci.splrep() function takes. 


Table 11-2. Parameters of splrep() function 


Parameter Description 


x (Ordered) x coordinates (independent variable values) 
y (x-ordered) y coordinates (dependent variable values) 
w Weights to apply to the y coordinates 

xb, xe Interval to fit; if None then [x[0], x[-1]] 

k Order of the spline fit (1 < k < 5) 

s Smoothing factor (the larger, the more smoothing) 


full_output If True, returns additional output 


quiet If True, suppresses messages 


Table 11-3 lists the parameters that the sci.splev() function takes. 


Table 11-3. Parameters of splev() function 


Parameter Description 


x (Ordered) x coordinates (independent variable values) 

tck Sequence of length 3 returned by spLrep( ) (knots, coefficients, degree) 

der Order of derivative (0 for function, 1 for first derivative) 

ext Behavior if x not in knot sequence (© = extrapolate, 1 = return 0, 2 = raise ValueError) 


Spline interpolation is often used in finance to generate estimates for dependent val- 
ues of independent data points not included in the original observations. To this end, 
the next example picks a much smaller interval and has a closer look at the interpola- 
ted values with the linear splines. Figure 11-12 reveals that the interpolation function 
indeed interpolates linearly between two observation points. For certain applications 
this might not be precise enough. In addition, it is evident that the function is not 
continuously differentiable at the original data points—another drawback: 


In [52]: xd = np.linspace(1.0, 3.0, 50) @ 
iyd = spi.splev(xd, ipo) 


In [53]: create_plot([xd, xd], [f(xd), iyd], ['b', 'ro'], 
['f(x)', ‘interpolation'], ['x', 'f(x)']) 


@ Smaller interval with more points. 
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Figure 11-12. Linear splines interpolation (data subset) 


A repetition of the complete exercise, this time using cubic splines, improves the 
results considerably (see Figure 11-13): 


In [54]: ipo = spi.splrep(x, f(x), k=3) (1) 
iyd = spi.splev(xd, ipo) (2) 


In [55]: np.allclose(f(xd), iyd) © 
Out[55]: False 


In [56]: np.mean((f(xd) - iyd) ** 2) (4) 
Out[56]: 1.1349319851436892e-08 


In [57]: create_plot([xd, xd], [f(xd), iyd], ['b', 'ro'], 
['f(x)', 'interpolation'], ['x', 'f(x)']) 


Cubic splines interpolation on complete data sets. 
Results applied to the smaller interval. 


The interpolation is still not perfect ... 


o © 8 8 


... but better than before. 


Approximation | 327 


Spe 
@ interpolation 


1.9 


1.8 


Liz 


f(x) 


1.6 


1.5 


1.4 


1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 


Figure 11-13. Cubic splines interpolation (data subset) 


Interpolation 


In those cases where spline interpolation can be applied, one can 
expect better approximation results compared to a least-squares 
regression approach. However, remember that sorted (and “non- 
noisy”) data is required and that the approach is limited to low- 
dimensional problems. It is also computationally more demanding 
and might therefore take (much) longer than regression in certain 
use cases. 


Convex Optimization 


In finance and economics, convex optimization plays an important role. Examples are 
the calibration of option pricing models to market data or the optimization of an 
agent’s utility function. As an example, take the function fm(): 


In [58]: def fm(p): 
xX, Y= P 
return (np.sin(x) + 0.05 * x ** 2 
+ np.sin(y) + 0.05 * y ** 2) 

Figure 11-14 shows the function graphically for the defined intervals for x and y. Vis- 
ual inspection already reveals that this function has multiple local minima. The exis- 
tence of a global minimum cannot really be confirmed by this particular graphical 
representation, but it seems to exist: 
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In [59]: 


x = np.linspace(-10, 10, 50) 
y = np.linspace(-10, 10, 50) 
X, Y = np.meshgrid(x, y) 

Z = fm((X, Y)) 


In [60]: fig = plt.figure(figsize=(10, 6)) 

ax = fig.gca(projection='3d') 

surf = ax.plot_surface(X, Y, Z, rstride=2, cstride=2, 
cmap='coolwarm', Linewidth=0.5, 
antialiased=True) 

ax.set_xLabel('x') 

ax.set_ylabel('y') 

ax.set_zlabel('f(x, y)') 

fig.colorbar(surf, shrink=0.5, aspect=5) 
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Figure 11-14. Linear splines interpolation (data subset) 


Global Optimization 


In what follows, both a global minimization approach and a local one are imple- 
mented. The functions sco.brute() and sco.fmin() that are applied are from 
scipy.optimize. 


To have a closer look behind the scenes during minimization procedures, the follow- 
ing code amends the original function by an option to output current parameter val- 
ues as well as the function value. This allows us to keep track of all relevant 
information for the procedure: 
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In [61]: import scipy.optimize as sco (1) 


In [62]: def fo(p): 
xX, Y=P 
z = np.sin(x) + 0.05 * x ** 2 + np.sin(y) + 0.05 * y ** 2 
if output == True: 
print('%8.4f | %8.4f | %8.4f' % (x, y, z)) © 
return z 


In [63]: output = True 
sco.brute(fo, ((-10, 10.1, 5), (-10, 10.1, 5)), finish=None) © 


-10.0000 | -10.0000 11.0880 
-10.0000 | -10.0000 11.0880 
-10.0000 | -5.0000 7:4529 
-10.0000 | 0.0000 5.5440 
-10.0000 | 5.0000 5,8351 
-10.0000 | 10.0000 10.0000 
-5.0000 | -10.0000 T.7529 
-5.0000 | -5.0000 4.4178 
-5.0000 | 0.0000 2:2089 
-5.0000 | 5.0000 2.5000 
-5.0000 | 10.0000 6.6649 
0.0000 | -10.0000 5.5440 
0.0000 | -5.0000 2.2089 
0.0000 | 6.0000 0.0000 
0.0000 | 5.0000 0.2911 
0.0000 | 10.0000 4.4560 
5.0000 | -10.0000 338351 
5.0000 | -5.0000 2.5000 
5.0000 | 0.0000 06,2911 
5.0000 | 5.0000 0.5822 
5.0000 | 10.0000 4.7471 
10.0000 | -10.0000 10.0000 
10.0000 | -5.0000 6.6649 
10.0000 | 0.0000 4.4560 
10.0000 | 5.0000 4.7471 
10.0000 | 10.0000 8.9120 


Out[63]: array([0., 0.]) 


© Imports the required subpackage from SciPy. 
@ The information to print out if output = True. 


© The brute force optimization. 


The optimal parameter values, given the initial parameterization of the function, are 
x = y = 0. The resulting function value is also 0, as a quick review of the preceding 
output reveals. One might be inclined to accept this as the global minimum. How- 
ever, the first parameterization here is quite rough, in that step sizes of 5 for both 
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input parameters are used. This can of course be refined considerably, leading to bet- 
ter results in this case—and showing that the previous solution is not the optimal 
one: 


In [64]: output = False 
opti = sco.brute(fo, ((-10, 10.1, 0.1), (-10, 10.1, 0.1)), finish=None) 


In [65]: opt1 
Out[65]: array([-1.4, -1.4]) 


In [66]: fm(opt1) 
Out[66]: -1.7748994599769203 


The optimal parameter values are now x = y = -1.4 and the minimal function value 
for the global minimization is about -1.7749. 


Local Optimization 


The local convex optimization that follows draws on the results from the global opti- 
mization. The function sco.fmin() takes as input the function to minimize and the 
starting parameter values. Optional parameter values are the input parameter toler- 
ance and function value tolerance, as well as the maximum number of iterations and 
function calls. The local optimization further improves the result: 

In [67]: output = True 


opt2 = sco.fmin(fo, opti, xtol=0.001, ftol=0.001, 
maxiter=15, maxfun=20) (1) 


-1.4000 | -1.4000 -1.7749 
-1.4700 | -1.4000 -1.7743 
-1.4000 | -1.4700 -1.7743 
-1.3300 | -1.4700 -1:7696 
-1.4350 | -1.4175 =1, 7756 
-1.4350 | -1.3475 -1.7722 
-1.4088 | -1.4394 -1:7733 
-1.4438 | -1.4569 =1, 7751 
-1.4328 | -1.4427 -1:7756 
-1.4591 | -1.4208 -1:7132 
-1.4213 | -1.4347 bee EST 
-1.4235 | -1.4096 “11755 
-1.4305 | -1.4344 “1.7150 
-1.4168 | -1.4516 -1:7753 
-1.4305 | -1.4260 EiT 
-1.4396 | -1.4257 -17756 
-1.4259 | -1.4325 “1.7757 
-1.4259 | -1.4241 EET 
-1.4304 | -1.4177 MLT 
-1.4270 | -1.4288 “1.7757 


Warning: Maximum number of function evaluations has been exceeded. 


In [68]: opt2 
Out[68]: array([-1.42702972, -1.42876755]) 
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In [69]: fm(opt2) 
Out[69]: -1.7757246992239009 


© The local convex optimization. 


For many convex optimization problems it is advisable to have a global minimization 
before the local one. The major reason for this is that local convex optimization algo- 
rithms can easily be trapped in a local minimum (or do “basin hopping”), ignoring 
completely better local minima and/or a global minimum. The following shows that 
setting the starting parameterization to x = y = 2 gives, for example, a “minimum” 
value of above zero: 
In [70]: output = False 
sco.fmin(fo, (2.0, 2.0), maxiter=250) 
Optimization terminated successfully. 
Current function value: 0.015826 


Iterations: 46 
Function evaluations: 86 


Out[70]: array([4.2710728 , 4.27106945]) 


Constrained Optimization 


So far, this section only considers unconstrained optimization problems. However, 
large classes of economic or financial optimization problems are constrained by one 
or multiple constraints. Such constraints can formally take on the form of equalities 
or inequalities. 


As a simple example, consider the utility maximization problem of an (expected util- 
ity maximizing) investor who can invest in two risky securities. Both securities cost q, 
= q, = 10 USD today. After one year, they have a payoff of 15 USD and 5 USD, 
respectively, in state u, and of 5 USD and 12 USD, respectively, in state d. Both states 
are equally likely. Denote the vector payoffs for the two securities by r, and r, 
respectively. 


The investor has a budget of w, = 100 USD to invest and derives utility from future 
wealth according to the utility function u(w) = alw, where w is the wealth (USD 
amount) available. Equation 11-2 is a formulation of the maximization problem 
where a, b are the numbers of securities bought by the investor. 
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Equation 11-2. Expected utility maximization problem (1) 


maxE(u(™)) = pw, + (1 - pialnig 
w = a-r,t+b-r, 
w > a-q,+b-q 
a,b => 0 


Putting in all numerical assumptions, one gets the problem in Equation 11-3. Note 
the change to the minimization of the negative expected utility. 


Equation 11-3. Expected utility maximization problem (2) 


a - (0.5 -wu + 0.5-w,4) 


Wi, = a-15+b-5 
Wa = a-5+b.12 
100 > a-10+b-10 
a,b = 


To solve this problem, the scipy.optimize.minimize() function is appropriate. This 
function takes as input—in addition to the function to be minimized—conditions in 
the form of equalities and inequalities (as a list of dict objects) as well as bound- 
aries for the parameters (as a tuple of tuple objects).' The following translates the 
problem from Equation 11-3 into Python code: 


In [71]: import math 


In [72]: def Eu(p): 1] 
s, be =p 
return -(0.5 * math.sqrt(s * 15 + b * 5) + 
0.5 * math.sqrt(s * 5 + b * 12)) 
In [73]: cons = ({'type': 'ineq', 
'fun': Lambda p: 100 - p[Q] * 10 - p[1] * 10}) (2) 


In [74]: bnds = ((0, 1000), (0, 1000)) © 


In [75]: result = sco.minimize(Eu, [5, 5], method='SLSQP', 
bounds=bnds, constraints=cons) e 


1 For details and examples of how to use the minimize function, refer to the documentation. 


Convex Optimization | 333 


© © 


© 


The function to be minimized, in order to maximize the expected utility. 
The inequality constraint as a dict object. 


The boundary values for the parameters (chosen to be wide enough). 


The constrained optimization. 


The result object contains all the relevant information. With regard to the minimal 
function value, one needs to recall to shift the sign back: 


In [76]: 
Out[76]: 


In [77]: 
Out[77]: 


In [78]: 
Out[78]: 


In [79]: 
Out[79]: 


result 
fun: -9.700883611487832 
jac: array([-0.48508096, -0.48489535]) 
message: ‘Optimization terminated successfully. ' 
nfev: 21 
nit: $ 
njev: 5 
status: 0 
success: True 
x: array([8.02547122, 1.97452878]) 


result['x'] (1) 
array([8.02547122, 1.97452878]) 


-result['fun'] (2) 
9. 700883611487832 


np.dot(result['x'], [10, 10]) © 
99 .99999999999999 


@ The optimal parameter values (i.e., the optimal portfolio). 


© The negative minimum function value as the optimal solution value. 


© The budget constraint is binding; all wealth is invested. 


Integration 


Especially when it comes to valuation and option pricing, integration is an important 
mathematical tool. This stems from the fact that risk-neutral values of derivatives can 
be expressed in general as the discounted expectation of their payoff under the risk- 
neutral or martingale measure. The expectation in turn is a sum in the discrete case 
and an integral in the continuous case. The subpackage scipy. integrate provides 
different functions for numerical integration. The example function is known from 
“Approximation” on page 312: 
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In [80]: import scipy.integrate as sci 


In [81]: def f(x): 
return np.sin(x) + 0.5 * x 


The integration interval shall be [0.5, 9.5], leading to the definite integral as in 
Equation 11-4. 


Equation 11-4. Integral of example function 
ie F : x 
” f(x)dx = as Sin (x) + z% 


The following code defines the major Python objects to evaluate the integral: 


In [82]: x = np.linspace(0, 10) 
f(x) 

0.5 @ 

=9.5 © 

np. linspace(a, b) © 
f(Ix) 


x 
y 
a 
b 


Ix 
Iy 


© Left integration limit. 
© Right integration limit. 
© Integration interval values. 


© Integration function values. 
Figure 11-15 visualizes the integral value as the gray-shaded area under the function: 


In [83]: from matplotlib.patches import Polygon 


In [84]: fig, ax = plt.subplots(figsize=(10, 6)) 

plt.plot(x, y, 'b', lLinewidth=2) 

plt.ylim(bottom=0) 

Ix = np.linspace(a, b) 

Iy = f(Ix) 

verts = [(a, 0)] + list(zip(Ix, Iy)) + [(b, 0)] 

poly = Polygon(verts, facecolor='0.7', edgecolor='0.5') 

ax.add_patch(poly) 

plt.text(0.75 * (a + b), 1.5, r"$\int_a*b f(x)dx$", 
horizontalalignment='center', fontsize=20) 

plt.figtext(0.9, 0.075, 'Sx$') 

plt.figtext(0.075, 0.9, 'Sf(x)$') 

ax.set_xticks((a, b)) 


2 See Chapter 7 for a more detailed discussion of this type of plot. 
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ax.set_xticklabels(('$a$', '$b$')) 
ax.set_yticks([f(a), f(b)]); 


4.675 


i PAxddx 


a b 


Figure 11-15. Integral value as shaded area 


Numerical Integration 


The scipy.integrate subpackage contains a selection of functions to numerically 
integrate a given mathematical function for upper and lower integration limits. 
Examples are sci.fixed_quad() for fixed Gaussian quadrature, sci.quad() for 
adaptive quadrature, and sci.romberg() for Romberg integration: 


In [85]: sci.fixed_quad(f, a, b)[0] 
Out[85]: 24.366995967084602 


In [86]: sci.quad(f, a, b)[0] 
Out[86]: 24.374754718086752 


In [87]: sci.romberg(f, a, b) 
Out[87]: 24.374754718086713 


There are also a number of integration functions that take as input list or ndarray 
objects with function values and input values, respectively. Examples in this regard 
are sci.trapz(), using the trapezoidal rule, and sci.simps(), implementing 
Simpson’s rule: 


In [88]: xi = np.linspace(0.5, 9.5, 25) 


In [89]: sci.trapz(f(xi), xi) 
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Out[89]: 24.352733271544516 


In [90]: sci.simps(f(xi), xi) 
Out[90]: 24.37496418455075 


Integration by Simulation 


The valuation of options and derivatives by Monte Carlo simulation (see Chapter 12) 
rests on the insight that one can evaluate an integral by simulation. To this end, draw 
I random values of x between the integral limits and evaluate the integration function 
at every random value for x. Sum up all the function values and take the average to 
arrive at an average function value over the integration interval. Multiply this value 
by the length of the integration interval to derive an estimate for the integral value. 


The following code shows how the Monte Carlo estimated integral value converges— 
although not monotonically—to the real one when one increases the number of ran- 
dom draws. The estimator is already quite close for relatively small numbers of 
random draws: 


In [91]: for i in range(1, 20): 
np.random. seed(1000) 
x = np.random.random(i * 10) * (b-a)+a (1) 
print(np.mean(f(x)) * (b - a)) 
24.804762279331463 
26.522918898332378 
26.265547519223976 
26 .02770339943824 
24.99954181440844 
23.881810141621663 
23.527912274843253 
23.507857658961207 
23.67236746066989 
23.679410416062886 
24.424401707879305 
24.239005346819056 
24.115396924962802 
24.424191987566726 
23..924933080533783 
24.19484212027875 
24.117348378249833 
24.100690929662274 
23.76905109847816 


© Number of random x values is increased with every iteration. 


Symbolic Computation 


The previous sections are mainly concerned with numerical computation. This sec- 
tion now introduces symbolic computation, which can be applied beneficially in 
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many areas of finance. To this end, SymPy, a library specifically dedicated to symbolic 
computation, is generally used. 


Basics 
SymPy introduces new classes of objects. A fundamental class is the Symbol class: 
In [92]: import sympy as sy 


In [93]: x = sy.Symbol('x') (1) 
y = sy.Symbol('y') (13 


In [94]: type(x) 
Out[94]: sympy.core.symbol.Symbol 


In [95]: sy.sqrt(x) (2) 
Out[95]: sqrt(x) 


In [96]: 3 + sy.sqrt(x) - 4 ** 2 © 
Out[96]: sqrt(x) - 13 


In [97]: f=x**2+3+0.5*x**2+3/2 Q 


In [98]: sy.simplify(f) (5) 
Out[98]: 1.5*x**2 + 4.5 


Defines symbols to work with. 
Applies a function on a symbol. 
A numerical expression defined on symbol. 


A function defined symbolically. 


© 6 © 8 Ọ 


The function expression simplified. 


This already illustrates a major difference to regular Python code. Although x has no 
numerical value, the square root of x is nevertheless defined with SymPy since x is a 
Symbol object. In that sense, sy.sqrt(x) can be part of arbitrary mathematical 
expressions. Notice that SymPy in general automatically simplifies a given mathemati- 
cal expression. Similarly, one can define arbitrary functions using Symbol objects. 
They are not to be confused with Python functions. 


SymPy provides three basic renderers for mathematical expressions: 


e LaTeX-based 
e Unicode-based 
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e ASCII-based 


When working, for example, solely in a Jupyter Notebook environment (HTML- 
based), LaTeX rendering is generally a good (i.e., visually appealing) choice. The code 
that follows sticks to the simplest option, ASCII, to illustrate that there is no manual 
typesetting involved: 


In [99]: sy.init_printing(pretty_print=False, use_unicode=False) 


In [100]: 


In [101]: 


print(sy.pretty(f)) 

2 
1.5*x + 4.5 
print(sy.pretty(sy.sqrt(x) + 0.5)) 


\/x +0.5 


This section cannot go into details, but SymPy also provides many other useful mathe- 
matical functions—for example, when it comes to numerically evaluating 1. The fol- 
lowing example shows the first and final 40 characters of the string representation of 
n up to the 400,000th digit. It also searches for a six-digit, day-first birthday—a popu- 
lar task in certain mathematics and IT circles: 


In [102]: 


In [103]: 
Out[103]: 


In [104]: 
Out[104]: 


In [105]: 


Out[105]: 


© © 8 8 


%time pi_str = str(sy.N(sy.pi, 400000)) @ 
CPU times: user 400 ms, sys: 10.9 ms, total: 411 ms 
Wall time: 501 ms 


pi_str[:42] (2) 
'3.1415926535897932384626433832795028841971' 


pi_str[-40:] © 
'8245672736856312185020980470362464176198' 


%time pi_str.find('061072') (4) 
CPU times: user 115 ps, sys: 1e+03 ns, total: 116 ps 
Wall time: 120 ps 


80847 


Returns the string representation of the first 400,000 digits of n. 
Shows the first 40 digits ... 
... and the final 40 digits. 


Searches for a birthday date in the string. 
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Equations 


A strength of SymPy is solving equations, e.g., of the form x? - 1 = 0. In general, SymPy 
presumes that one is looking for a solution to the equation obtained by equating the 
given expression to zero. Therefore, equations like x? - 1 = 3 might have to be refor- 
mulated to get the desired result. Of course, SymPy can cope with more complex 
expressions, like x? + 0.5 x? - 1 = 0. Finally, it can also deal with problems involving 
imaginary numbers, such as x? + y’ = 0: 


In [106]: sy.solve(x ** 2 - 1) 
Out[106]: [-1, 1] 


In [107]: sy.solve(x ** 2 - 1 - 3) 
Out[107]: [-2, 2] 


In [108]: sy.solve(x ** 3 + 0.5 * x ** 2 - 1) 

Out[108]: [0.858094329496553, -0.679047164748276 - 0.839206763026694*1, 
-0.679047164748276 + 0.839206763026694*T | 

In [109]: sy.solve(x ** 2 + y ** 2) 

Out[109]: [{x: -I*y}, {x: I*y}] 


Integration and Differentiation 


Another strength of SymPy is integration and differentiation. The example that fol- 
lows revisits the example function used for numerical- and simulation-based integra- 
tion and derives both a symbolically and a numerically exact solution. Symbol objects 
for the integration limits objects are required to get started: 


In [110]: a, b = sy.symbols('a b') (1) 
In [111]: I = sy.Integral(sy.sin(x) + 0.5 * x, (x, a, b)) (2) 
In [112]: print(sy.pretty(I)) (2) 

b 

/ 


| (@.5*x + sin(x)) dx 


In [113]: int_func = sy.integrate(sy.sin(x) + 0.5 * x, x) © 


In [114]: print(sy.pretty(int_func)) © 
2 
0.25*x - cos(x) 


In [115]: Fb = int_func.subs(x, 9.5).evalf() (4) 
Fa = int_func.subs(x, 0.5).evalf() (4) 
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In [116]: Fb - Fa © 
Out[116]: 24.3747547180867 


The Symbol objects for the integral limits. 
The Integral object defined and pretty-printed. 


The antiderivative derived and pretty-printed. 


© © 8 8 


The values of the antiderivative at the limits, obtained via the .subs() 
and .evalf() methods. 

© The exact numerical value of the integral. 

The integral can also be solved symbolically with the symbolic integration limits: 

In [117]: int_func_limits = sy.integrate(sy.sin(x) + 0.5 * x, (x, a, b)) (1) 

In [118]: print(sy.pretty(int_func_limits)) (1) 


Pa 2 
- 0.25*a + 0.25*b + cos(a) - cos(b) 


In [119]: int_func_limits.subs({a : 0.5, b : 9.5}).evalf() (2) 
Out[119]: 24.3747547180868 


In [120]: sy.integrate(sy.sin(x) + 0.5 * x, (x, 0.5, 9.5)) © 
Out[120]: 24.3747547180867 


@ Solving the integral symbolically. 
© Solving the integral numerically, using a dict object during substitution. 


© Solving the integral numerically in a single step. 


Differentiation 


The derivative of the antiderivative yields in general the original function. Applying 
the sy.diff() function to the symbolic antiderivative illustrates this: 

In [121]: int_func.diff() 

Out[121]: 0.5*x + sin(x) 
As with the integration example, differentiation shall now be used to derive the exact 
solution of the convex minimization problem this chapter looked at earlier. To this 
end, the respective function is defined symbolically, partial derivatives are derived, 
and the roots are identified. 
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A necessary but not sufficient condition for a global minimum is that both partial 
derivatives are zero. However, there is no guarantee of a symbolic solution. Both 
algorithmic and (multiple) existence issues come into play here. However, one can 
solve the two first-order conditions numerically, providing “educated” guesses based 
on the global and local minimization efforts from before: 


In [122]: f = (sy.sin(x) + 0.05 * x ** 2 
+ sy.sin(y) + 0.05 * y ** 2) (1) 


In [123]: del_x = sy.diff(f, x) (2) 
del_x (2) 
0.1*x + cos(x) 


Out[123]: 
In [124]: del_y = sy.diff(f, y) @ 


del_y (2) 
Out[124]: 0.1*y + cos(y) 


In [125]: xo = sy.nsolve(del_x, -1.5) © 
xo 
Out[125]: -1.42755177876459 


In [126]: yo = sy.nsolve(del_y, -1.5) © 
yo 


Out[126]: -1.42755177876459 
In [127]: f.subs({x : xo, y : yo}).evalf() (4) 
Out[127]: -1.77572565314742 


The symbolic version of the function. 


1) 

@ The two partial derivatives derived and printed. 

© Educated guesses for the roots and resulting optimal values. 
14) 


The global minimum function value. 


Again, providing uneducated/arbitrary guesses might trap the algorithm in a local 
minimum instead of the global one: 


In [128]: xo = sy.nsolve(del_x, 1.5) (1) 
xo 
1.74632928225285 


Out[128]: 


In [129]: yo = sy.nsolve(del_y, 1.5) 1] 
yo 
Out[129]: 1.74632928225285 


f.subs({x : xo, y : yo}).evalf() (2) 
2.27423381055640 


In [130]: 


Out[130]: 
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@ Uneducated guesses for the roots. 


@ The local minimum function value. 


This numerically illustrates that the first-order conditions are necessary but not 
sufficient. 


Symbolic Computations 


When doing (financial) mathematics with Python, SymPy and sym- 
bolic computations prove to be a valuable tool. Especially for inter- 
active financial analytics, this can be a more efficient approach 
compared to nonsymbolic approaches. 


Conclusion 


This chapter covers selected mathematical topics and tools important to finance. For 
example, the approximation of functions is important in many financial areas, like 
factor-based models, yield curve interpolation, and regression-based Monte Carlo 
valuation approaches for American options. Convex optimization techniques are also 
regularly needed in finance; for example, when calibrating parametric option pricing 
models to market quotes or implied volatilities of options. 


Numerical integration is central to, for example, the pricing of options and deriva- 
tives. Having derived the risk-neutral probability measure for a (set of) stochastic 
process(es), option pricing boils down to taking the expectation of the option’s payoff 
under the risk-neutral measure and discounting this value back to the present date. 
Chapter 12 covers the simulation of several types of stochastic processes under the 
risk-neutral measure. 


Finally, this chapter introduces symbolic computation with SymPy. For a number of 
mathematical operations, like integration, differentiation, or the solving of equations, 
symbolic computation can prove a useful and efficient tool. 


Further Resources 


For further information on the Python libraries used in this chapter, consult the fol- 
lowing web resources: 


e See the NumPy Reference for details on the NumPy functions used in this chapter. 


e Visit the SciPy documentation on optimization and root finding for details on 
scipy.optimize. 


e Integration with scipy.integrate is explained in “Integration and ODEs”. 
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e The SymPy website provides a wealth of examples and detailed documentation. 
For a mathematical reference for the topics covered in this chapter, see: 


e Brandimarte, Paolo (2006). Numerical Methods in Finance and Economics: A 
MATLAB-Based Introduction. 2nd ed., Hoboken, NJ: John Wiley & Sons. 
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CHAPTER 12 
Stochastics 


Predictability is not how things will go, but how they can go. 
—Raheel Farooq 


Nowadays, stochastics is one of the most important mathematical and numerical dis- 
ciplines in finance. In the beginning of the modern era of finance, mainly in the 
1970s and 1980s, the major goal of financial research was to come up with closed- 
form solutions for, e.g., option prices given a specific financial model. The require- 
ments have drastically changed in recent years in that not only is the correct 
valuation of single financial instruments important to participants in the financial 
markets, but also the consistent valuation of whole derivatives books, for example. 
Similarly, to come up with consistent risk measures across a whole financial institu- 
tion, like value-at-risk and credit valuation adjustments, one needs to take into 
account the whole book of the institution and all its counterparties. Such daunting 
tasks can only be tackled by flexible and efficient numerical methods. Therefore, sto- 
chastics in general and Monte Carlo simulation in particular have risen to promi- 
nence in the financial field. 


This chapter introduces the following topics from a Python perspective: 


“Random Numbers” on page 346 
It all starts with pseudo-random numbers, which build the basis for all simula- 
tion efforts; although quasi-random numbers (e.g., based on Sobol sequences) 
have gained some popularity in finance, pseudo-random numbers still seem to 
be the benchmark. 


“Simulation” on page 352 
In finance, two simulation tasks are of particular importance: simulation of ran- 
dom variables and of stochastic processes. 
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“Valuation” on page 375 
The two main disciplines when it comes to valuation are the valuation of deriva- 
tives with European exercise (at a specific date) and American exercise (over a 
specific time interval); there are also instruments with Bermudan exercise, or 
exercise at a finite set of specific dates. 


“Risk Measures” on page 383 
Simulation lends itself pretty well to the calculation of risk measures like value- 
at-risk, credit value-at-risk, and credit valuation adjustments. 


Random Numbers 


Throughout this chapter, to generate random numbers,' the functions provided by 
the numpy. random subpackage are used: 
In [1]: import math 
import numpy as np 


import numpy.random as npr 1) 
from pylab import plt, mpl 


In [2]: plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


@ Imports the random number generation subpackage from NumPy. 


For example, the rand() function returns random numbers from the open interval 
[0,1) in the shape provided as a parameter to the function. The return object is an 
ndarray object. Such numbers can be easily transformed to cover other intervals of 
the real line. For instance, if one wants to generate random numbers from the inter- 
val [a,b)=[5,10), one can transform the returned numbers from npr.rand() as in the 
next example—this also works in multiple dimensions due to NumPy broadcasting: 


In [3]: npr.seed(100) (13 
np.set_printoptions(precision=4) (1) 


In [4]: npr.rand(10) (2) 
Out[4]: array([0.5434, 0.2784, 0.4245, 0.8448, 0.0047, 0.1216, 0.6707, 0.8259, 
0.1367, 0.5751]) 


In [5]: npr.rand(5, 5) © 

Out[5]: array([[0.8913, 0.2092, 0.1853, 0.1084, 0.2197], 
[0.9786, 0.8117, 0.1719, 0.8162, 0.2741], 
[0.4317, 0.94 , 0.8176, 0.3361, 0.1754], 
[0.3728, 0.0057, 0.2524, 0.7957, 0.0153], 


1 For simplicity, we will speak of random numbers knowing that all numbers used will be pseudo-random. 
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(2) 
© 
(4) 
© 
(67 
(7) 


[0.5988, 0.6038, 0.1051, 0.3819, 0.0365]]) 


In [6]: a=5. Q 
b=1i10. © 
npr.rand(10) * (b - a) +a Q 
Out[6]: array([9.4521, 9.9046, 5.2997, 9.4527, 7.8845, 8.7124, 8.1509, 7.9092, 
5.1022, 6.0501]) 


In [7]: npr.rand(5, 5) * (b - a) +a @ 

Out[7]: array([[7.7234, 8.8456, 6.2535, 6.4295, 9.262 ], 
[9:875 9.42435;,..6.7975;. 7.9948, 6.774 ], 
[6.701 , 5.8904, 6.1885, 5.2243, 7.5272], 
[6.8813, 7.964 , 8.1497, 5.713 , 9.6692], 
[9.7319, 8.0115, 6.9388, 6.8159, 6.0217]]) 


Fixes the seed value for reproducibility and fixes the number of digits for print- 


outs. 

Uniformly distributed random numbers as one-dimensional ndarray object. 
Uniformly distributed random numbers as two-dimensional ndarray object. 
Lower limit ... 

... and upper limit ... 

... for the transformation to another interval. 


The same transformation for two dimensions. 


Table 12-1 lists functions to generate simple random numbers. 


Table 12-1. Functions for simple random number generation 


Function Parameters Returns/result 

rand do, d1, ..., dn Random values in the given shape 

randn do, d1, ..., dn A sample (or samples) from the standard normal distribution 
randint low[, high, size] Random integers from Low (inclusive) to high (exclusive) 
random_integers low[, high, size] Random integers between Low and high, inclusive 
random_sample [size] Random floats in the half-open interval [0.0, 1.0) 

random [size] Random floats in the half-open interval [0.0, 1.0) 

ranf [size] Random floats in the half-open interval [0.0, 1.0) 

sample [size] Random floats in the half-open interval [0.0, 1.0) 

choice a[, size, replace, p] Random sample from a given 1D array 

bytes length Random bytes 
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It is straightforward to visualize some random draws generated by selected functions 
from Table 12-1. Figure 12-1 shows the results graphically for two continuous distri- 
butions and two discrete ones: 


In [8]: sample_size = 500 
rni = npr.rand(sample_size, 3) (1) 
rn2 = npr.randint(0, 10, sample_size) (2) 
rn3 = npr.sample(size=sample_size) (17 
a= [0, 25, 50, 75, 100] 
rn4 = npr.choice(a, size=sample_size) © 


In [9]: fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, 
figsize=(10, 8)) 
ax1.hist(rn1, bins=25, stacked=True) 
ax1.set_title('rand') 
ax1.set_ylabel('frequency') 
ax2.hist(rn2, bins=25) 
ax2.set_title('randint') 
ax3.hist(rn3, bins=25) 
ax3.set_title('sample') 
ax3.set_ylabel('frequency') 
ax4.hist(rn4, bins=25) 
ax4.set_title('choice'); 


@ Uniformly distributed random numbers. 
@ Random integers for a given interval. 


© Randomly sampled values from a finite List object. 
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Figure 12-1. Histograms of simple random numbers 


Table 12-2 lists functions for generating random numbers according to different dis- 
tributions. 


Table 12-2. Functions to generate random numbers according to different distribution laws 


Function Parameters Returns/result 
beta a, b[, size] Samples for a beta distribution over [0, 1] 
binomial n, pl, size] Samples from a binomial distribution 
chisquare df[, size] Samples from a chi-square distribution 
dirichlet alphal, size] Samples from the Dirichlet distribution 
exponential [scale, size] Samples from the exponential distribution 
f dfnum, dfden|[, size] Samples from an F distribution 
gamma shapel, scale, size] Samples from a gamma distribution 
geometric pL, size] Samples from the geometric distribution 
gumbel [loc, scale, size] Samples from a Gumbel distribution 
hypergeometric ngood, nbad, nsample[, Samples from a hypergeometric distribution 
size] 
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Function 


laplace 


logistic 

lognormal 

logseries 
multinomial 
multivariate_normal 
negative_binomial 
noncentral_chisquare 


noncentral_f 


normal 


pareto 


poisson 


power 


rayleigh 
standard_cauchy 
standard_exponential 
standard_gamma 


standard_normal 


standard_t 


triangular 


uniform 
vonmises 
wald 
weibull 


zipf 


Parameters 


[loc, scale, size] 


[loc, scale, size] 
[mean, sigma, size] 
pl, size] 

n, pvals[, size] 
mean, cov[, size] 

n, pl, size] 


df, nonc[, size] 


dfnum, dfden, nonc[, 


size] 
loc, scale, size] 


al, size] 


Lam, size] 


al, size] 


scale, size] 
size] 
size] 


shape[, size] 


size] 


dfL, size] 


left, mode, right[, 
size] 


[Low, high, size] 
mu, kappal, size] 
mean, scalel[, size] 
al, size] 


al, size] 


Returns/result 


Samples from the Laplace or double exponential 
distribution 


Samples from a logistic distribution 

Samples from a log-normal distribution 

Samples from a logarithmic series distribution 
Samples from a multinomial distribution 

Samples from a multivariate normal distribution 
Samples from a negative binomial distribution 
Samples from a noncentral chi-square distribution 


Samples from the noncentral F distribution 


Samples from a normal (Gaussian) distribution 


Samples from a Pareto II or Lomax distribution with the 
specified shape 


Samples from a Poisson distribution 


Samples in [0, 1] from a power distribution with positive 
exponent a — 1 


Samples from a Rayleigh distribution 

Samples from standard Cauchy distribution with mode = 0 
Samples from the standard exponential distribution 
Samples from a standard gamma distribution 


Samples from a standard normal distribution (mean=0, 
stdev=1) 


Samples from a Student's t distribution with df degrees of 
freedom 


Samples from the triangular distribution over the interval 
[left, right] 


Samples from a uniform distribution 

Samples from a von Mises distribution 

Samples from a Wald, or inverse Gaussian, distribution 
Samples from a Weibull distribution 


Samples from a Zipf distribution 


Although there is much criticism around the use of (standard) normal distributions 
in finance, they are an indispensable tool and still the most widely used type of distri- 
bution, in analytical as well as numerical applications. One reason is that many finan- 
cial models directly rest in one way or another on a normal distribution or a log- 
normal distribution. Another reason is that many financial models that do not rest 
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directly on a (log-)normal assumption can be discretized, and therewith approxima- 
ted for simulation purposes, by the use of the normal distribution. 


As an illustration, Figure 12-2 visualizes random draws from the following 
distributions: 


e Standard normal with mean of 0 and standard deviation of 1 
e Normal with mean of 100 and standard deviation of 20 
e Chi square with 0.5 degrees of freedom 


e Poisson with lambda of 1 


Figure 12-2 shows the results for the three continuous distributions and the discrete 
one (Poisson). The Poisson distribution is used, for example, to simulate the arrival 
of (rare) external events, like a jump in the price of an instrument or an exogenic 
shock. Here is the code that generates it: 


In [10]: sample_size = 500 
rni = npr.standard_normal(sample_size) (1) 
rn2 = npr.normal(100, 20, sample_size) (2) 
rn3 = npr.chisquare(df=0.5, size=sample_size) © 
rn4 = npr.poisson(lam=1.0, size=sample_size) (4) 


In [11]: fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, 
figsize=(10, 8)) 
ax1.hist(rn1, bins=25) 
ax1.set_title('standard normal') 
ax1.set_ylabel('frequency') 
ax2.hist(rn2, bins=25) 
ax2.set_title('normal(100, 20)') 
ax3.hist(rn3, bins=25) 
ax3.set_title('chi square') 
ax3.set_ylabel('frequency') 
ax4.hist(rn4, bins=25) 
ax4.set_title('Poisson'); 


Standard normally distributed random numbers. 
Normally distributed random numbers. 


Chi-square distributed random numbers. 
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Poisson distributed numbers. 
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Figure 12-2. Histograms of random samples for different distributions 


NumPy and Random Numbers 


This section shows that NumPy is a powerful (even indispensable) 
tool when generating pseudo-random numbers in Python. The cre- 
ation of small or large ndarray objects with such numbers is not 
only convenient but also performant. 


Simulation 


Monte Carlo simulation (MCS) is among the most important numerical techniques 
in finance, if not the most important and widely used. This mainly stems from the 
fact that it is the most flexible numerical method when it comes to the evaluation of 
mathematical expressions (e.g., integrals), and specifically the valuation of financial 
derivatives. The flexibility comes at the cost of a relatively high computational bur- 
den, though, since often hundreds of thousands or even millions of complex compu- 
tations have to be carried out to come up with a single value estimate. 
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Random Variables 
Consider, for example, the Black-Scholes-Merton setup for option pricing. In their 
setup, the level of a stock index Sy at a future date T given a level S, as of today is 
given according to Equation 12-1. 

Equation 12-1. Simulating future index level in Black-Scholes-Merton setup 


1 
Sr = So exp (; - Solr + ofTz} 


The variables and parameters have the following meaning: 


Sr 
Index level at date T 
z 
Constant riskless short rate 
o 
Constant volatility (= standard deviation of returns) of S 
Zz 


Standard normally distributed random variable 


This financial model is parameterized and simulated as follows. The output of this 
simulation code is shown in Figure 12-3: 


In [12]: so = 100 @ 


r=0.05 @ 
sigma = 0.25 © 
T=2.0 O 
I = 10000 © 


ST1 = SO * np.exp((r - 0.5 * sigma ** 2) * T+ 
sigma * math.sqrt(T) * npr.standard_normal(1)) Q 


In [13]: plt.figure(figsize=(10, 6)) 
plt.hist(ST1, bins=50) 
plt.xlabel('index level') 
plt.ylabel('frequency'); 

The initial index level. 


The constant riskless short rate. 


The constant volatility factor. 


© © 8 8 


The horizon in year fractions. 
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© The number of simulations. 


The simulation itself via a vectorized expression; the discretization scheme makes 
use of the npr.standard_normal() function. 
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Figure 12-3. Statically simulated geometric Brownian motion (via npr.standard_nor- 


mal()) 


Figure 12-3 suggests that the distribution of the random variable as defined in Equa- 
tion 12-1 is log-normal. One could therefore also try to use the npr. lognormal() 
function to directly derive the values for the random variable. In that case, one has to 
provide the mean and the standard deviation to the function: 


In [14]: ST2 = SO * npr.lognormal((r - 0.5 * sigma ** 2) * T, 
sigma * math.sqrt(T), size=I) 1] 


In [15]: plt.figure(figsize=(10, 6)) 
plt.hist(ST2, bins=50) 
plt.xlabel('index level') 
plt.ylabel('frequency'); 


@ The simulation via a vectorized expression; the discretization scheme makes use 
of the npr. lognormal () function. 


The result is shown in Figure 12-4. 
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Figure 12-4. Statically simulated geometric Brownian motion (via npr.lognormal()) 


By visual inspection, Figures 12-3 and 12-4 indeed look pretty similar. This can be 
verified a bit more rigorously by comparing statistical moments of the resulting dis- 
tributions. To compare the distributional characteristics of simulation results, the 
scipy.stats subpackage and the helper function print_statistics(), as defined 
here, prove useful: 


In [16]: import scipy.stats as scs 


In [17]: def print_statistics(a1, a2): 
''' Prints selected statistics. 


Parameters 


a1, a2: ndarray objects 

results objects from simulation 
stal = scs.describe(a1) 1] 
sta2 = scs.describe(a2) (1) 
print('%14s %14s %14s' % 

('statistic', 'data set 1', 'data set 2')) 
print(45 * "-") 
print('%14s %14.3f %14.3f' % ('size', sta1[0], sta2[0])) 
print('%14s %14.3f %14.3f' % ('min', sta1[1][0], sta2[1][0])) 
print('%14s %14.3f %14.3f' % ('max', sta1[1][1], sta2[1][1])) 
print('%14s %14.3f %14.3f' % ('mean', sta1[2], sta2[2])) 
print('%14s %14.3f %14.3f' % ('std', np.sqąrt(sta1[3]), 

np.sqrt(sta2[3]))) 
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print('%14s %14.3f %14.3f' % ('skew', stai[4], sta2[4])) 
print('%14s %14.3f %14.3f' % ('kurtosis', sta1[5], sta2[5])) 


In [18]: print_statistics(ST1, ST2) 


statistic data set 1 data set 2 
size 10000 .000 10000 .000 
min a2saer 28.239 
max 414.825 409.110 
mean 110.730 110.431 
std 40.300 39.878 
skew pres Bay 2.115 
kurtosis 2.438 2.217 


@ The scs.describe() function gives back important statistics for a data set. 


Obviously, the statistics of both simulation results are quite similar. The differences 
are mainly due to what is called the sampling error in simulation. Another error can 
also be introduced when discretely simulating continuous stochastic processes— 
namely the discretization error, which plays no role here due to the static nature of 
the simulation approach. 


Stochastic Processes 


Roughly speaking, a stochastic process is a sequence of random variables. In that 
sense, one should expect something similar to a sequence of repeated simulations of a 
random variable when simulating a process. This is mainly true, apart from the fact 
that the draws are typically not independent but rather depend on the result(s) of the 
previous draw(s). In general, however, stochastic processes used in finance exhibit 
the Markov property, which mainly says that tomorrow’s value of the process only 
depends on today’s state of the process, and not any other more “historic” state or 
even the whole path history. The process then is also called memoryless. 


Geometric Brownian motion 


Consider now the Black-Scholes-Merton model in its dynamic form, as described by 
the stochastic differential equation (SDE) in Equation 12-2. Here, Z, is a standard 
Brownian motion. The SDE is called a geometric Brownian motion. The values of S, 


are log-normally distributed and the (marginal) returns a normally. 


Equation 12-2. Stochastic differential equation in Black-Scholes-Merton setup 
dS, = rS,dt + 0S,dZ, 
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The SDE in Equation 12-2 can be discretized exactly by an Euler scheme. Such a 
scheme is presented in Equation 12-3, with At being the fixed discretization interval 
and z, being a standard normally distributed random variable. 


Equation 12-3. Simulating index levels dynamically in Black-Scholes-Merton setup 


1 
S; = Sear EXP (; = 50°At + ovate) 


As before, translation into Python and NumPy code is straightforward. The resulting 
end values for the index level are log-normally distributed again, as Figure 12-5 illus- 
trates. The first four moments are also quite close to those resulting from the static 
simulation approach: 


I = 19000 @ 
M = 50 
dt=T/m © 
S = np.zeros((M + 1, I)) (4) 
s[0] = so © 
for t in range(1, M + 1): 
S[t] = S[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt + 
sigma * math.sqrt(dt) * npr.standard_normal(I)) Q 


In [19]: 


In [20]: plt.figure(figsize=(10, 6)) 
plt.hist(S[-1], bins=50) 
plt.xlabel('index level') 
plt.ylabel('frequency'); 


The number of paths to be simulated. 

The number of time intervals for the discretization. 

The length of the time interval in year fractions. 

The two-dimensional ndar ray object for the index levels. 


The initial values for the initial point in time t = 0. 


© © 6 O 8 8 


The simulation via semivectorized expression; the loop is over the points in time 
starting at t = 1 and ending at t = T. 
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Figure 12-5. Dynamically simulated geometric Brownian motion at maturity 


Following is a comparison of the statistics resulting from the dynamic simulation as 
well as from the static simulation. Figure 12-6 shows the first 10 simulated paths: 


In [21]: print_statistics(S[-1], ST2) 


statistic data set 1 data set 2 
size 10000 .000 10000.000 

min 27.746 28.230 

max 382.096 409.110 

mean 110.423 110.431 

std 39.179 39.878 

skew 1.069 x Aes fe cy 
kurtosis 2.028 2.217 


In [22]: plt.figure(figsize=(10, 6)) 
plt.plot(S[:, :10], lw=1.5) 
plt.xlabel('time') 
plt.ylabel('index level'); 
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Figure 12-6. Dynamically simulated geometric Brownian motion paths 


Using the dynamic simulation approach not only allows us to visualize paths as dis- 
played in Figure 12-6, but also to value options with American/Bermudan exercise or 
options whose payoff is path-dependent. One gets the full dynamic picture over time, 
so to say. 

Square-root diffusion 


Another important class of financial processes is mean-reverting processes, which are 
used to model short rates or volatility processes, for example. A popular and widely 
used model is the square-root diffusion, as proposed by Cox, Ingersoll, and Ross 
(1985). Equation 12-4 provides the respective SDE. 

Equation 12-4. Stochastic differential equation for square-root diffusion 


dx, = «(0 - x,)dt + onl x,dZ, 
The variables and parameters have the following meaning: 


Xi 
Process level at date t 


Mean-reversion factor 
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Long-term mean of the process 


Constant volatility parameter 


Z, 
Standard Brownian motion 


It is well known that the values of x, are chi-squared distributed. However, as stated 
before, many financial models can be discretized and approximated by using the nor- 
mal distribution (ie., a so-called Euler discretization scheme). While the Euler 
scheme is exact for the geometric Brownian motion, it is biased for the majority of 
other stochastic processes. Even if there is an exact scheme available—one for the 
square-root diffusion will be presented later—the use of an Euler scheme might be 
desirable for numerical and/or computational reasons. Defining s= t- At and 
x* = max (x, 0), Equation 12-5 presents such an Euler scheme. This particular one is 
generally called full truncation in the literature (see Hilpisch (2015) for more details 
and other schemes). 


Equation 12-5. Euler discretization for square-root diffusion 


kK, =  &,4+«(0- Xi)At+ of xta/Atz, 


+ 
t 


x = & 
The square-root diffusion has the convenient and realistic characteristic that the val- 
ues of x, remain strictly positive. When discretizing it by an Euler scheme, negative 
values cannot be excluded. That is the reason why one works always with the positive 
version of the originally simulated process. In the simulation code, one therefore 
needs two ndarray objects instead of only one. Figure 12-7 shows the result of the 
simulation graphically as a histogram: 


In [23]: xo = 0.05 @ 
kappa = 3. e 
theta = 0.02 © 
sigma = 0. (4) 
I = 10000 

M = 50 

dt =T/M 


e O © 


In [24]: def srd_euler(): 
xh = np.zeros((M + 1, I)) 
x = np.zeros_like(xh) 
xh[0] = x0 
x[0] = x0 
for t in range(1, M + 1): 
xh[t] = (xh[t - 1] + 


360 | Chapter 12: Stochastics 


In [25]: 


© © © © 8 


x = np.maximum(xh, 0) 


return x 


x1 = srd_euler() 


The volatility factor. 


kappa * (theta - np.maximum(xh[t - 1], 0)) * dt + 
sigma * np.sqrt(np.maximum(xh[t - 1], 0)) * 
math.sqrt(dt) * npr.standard_normal(I)) (5) 


plt.figure(figsize=(10, 6)) 
plt.hist(x1[-1], bins=50) 
plt.xlabel('value') 
plt.ylabel('frequency'); 
The initial value (e.g., for a short rate). 


The mean reversion factor. 


The long-term mean value. 


The simulation based on an Euler scheme. 
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Figure 12-7. Dynamically simulated square-root diffusion at maturity (Euler scheme) 


Figure 12-8 then shows the first 10 simulated paths, illustrating the resulting negative 


average drift (due to x) > 0) and the convergence to 0 = 0.02: 
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In [26]: plt.figure(figsize=(10, 6)) 
plt.plot(x1[:, :10], lw=1.5) 
plt.xlabel('time') 
plt.ylabel('index level'); 


0.05 


0.04 
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50 


Figure 12-8. Dynamically simulated square-root diffusion paths (Euler scheme) 


Equation 12-6 presents the exact discretization scheme for the square-root diffusion 
based on the noncentral chi-square distribution y; with 


40x 
df SFA 


oy 


degrees of freedom and noncentrality parameter 


Ake -KAt 
nc = ol -e™) Xs 


Equation 12-6. Exact discretization for square-root diffusion 


o?(1 = et) é 
x, = 4K Xa 


Axe“ ) 
o7(1 = eo") Xs 
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The Python implementation of this discretization scheme is a bit more involved but 
still quite concise. Figure 12-9 shows the output at maturity of the simulation with 
the exact scheme as a histogram: 


In [27]: def srd_exact(): 
x = np.zeros((M + 1, I)) 
x[0] = x0 
for t in range(1, M + 1): 
df = 4 * theta * kappa / sigma ** 2 (1) 
c = (sigma ** 2 * (1 - np.exp(-kappa * dt))) / (4 * kappa) (1) 
nc = np.exp(-kappa * dt) / c * x[t - 1] 
x[t] = c * npr.noncentral_chisquare(df, nc, size=1) (1) 
return x 
x2 = srd_exact() 


In [28]: plt.figure(figsize=(10, 6)) 
plt.hist(x2[-1], bins=50) 
plt.xlabel('value') 
plt.ylabel('frequency'); 


@ Exact discretization scheme, making use of npr.noncentral_chisquare(). 
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Figure 12-9. Dynamically simulated square-root diffusion at maturity (exact scheme) 


Figure 12-10 presents as before the first 10 simulated paths, again displaying the neg- 
ative average drift and the convergence to 0: 


In [29]: plt.figure(figsize=(10, 6)) 
plt.plot(x2[:, :10], lw=1.5) 
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plt.xlabel('time') 
plt.ylabel('index level'); 
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Figure 12-10. Dynamically simulated square-root diffusion paths (exact scheme) 


Comparing the main statistics from the different approaches reveals that the biased 
Euler scheme indeed performs quite well when it comes to the desired statistical 
properties: 


In [30]: print_statistics(x1[-1], x2[-1]) 


statistic data set 1 data set 2 
size 10000. 000 10000. 000 
min 0.003 0.005 
max 0.049 0.047 
mean 0.020 0.020 
std 0.006 0.006 
skew 0.529 0.532 
kurtosis 0.289 9.273 


In [31]: I = 250000 
%time x1 = srd_euler() 
CPU times: user 1.62 s, sys: 184 ms, total: 1.81 s 
Wall time: 1.08 s 


In [32]: %time x2 = srd_exact() 
CPU times: user 3.29 s, sys: 39.8 ms, total: 3.33 s 
Wall time: 1.98 s 


In [33]: print_statistics(x1[-1], x2[-1]) 
x1 = 0.0; x2 = 0.0 
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statistic data set 1 data set 2 


size 250000. 000 250000.000 
min 0.002 0.003 
max 0.071 0.055 
mean 0.020 0.020 
std 0.006 0.006 
skew 0.563 0.579 
kurtosis 0.492 0.520 


However, a major difference can be observed in terms of execution speed, since sam- 
pling from the noncentral chi-square distribution is more computationally demand- 
ing than from the standard normal distribution. The exact scheme takes roughly 
twice as much time for virtually the same results as with the Euler scheme. 


Stochastic volatility 


One of the major simplifying assumptions of the Black-Scholes-Merton model is the 
constant volatility. However, volatility in general is neither constant nor deterministic 
—it is stochastic. Therefore, a major advancement with regard to financial modeling 
was achieved in the early 1990s with the introduction of so-called stochastic volatility 
models. One of the most popular models that fall into that category is that of Heston 
(1993), which is presented in Equation 12-7. 


Equation 12-7. Stochastic differential equations for Heston stochastic volatility 
model 


dS, = rS,dt + Jv,S,dZ; 
dv, = x«,(0,-v,)dt +ov,dZ? 
dZ'dZ? = p 


The meaning of the variables and parameters can now be inferred easily from the dis- 
cussion of the geometric Brownian motion and the square-root diffusion. The 
parameter p represents the instantaneous correlation between the two standard 
Brownian motions Z,', Z’. This allows us to account for a stylized fact called the lev- 
erage effect, which in essence states that volatility goes up in times of stress (declining 
markets) and goes down in times of a bull market (rising markets). 


Consider the following parameterization of the model. To account for the correlation 
between the two stochastic processes, one needs to determine the Cholesky decom- 
position of the correlation matrix: 


In [34]: SO = 100. 
r= 0.05 
vo=-0.1 © 
kappa = 3.0 
theta = 0.25 
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(2) 
© 


In [35]: 


In [36]: 
Out[36]: 


sigma 
rho = 
T= 1. 


oo il 
a © 
© 


corr_mat = np.zeros((2, 2)) 
corr_mat[0, :] = [1.0, rho] 
corr_mat[1, :] = [rho, 1.0] 
cho_mat = np. linalg.cholesky(corr_mat) © 


cho_mat © 


array([[1. , 0. ], 
[0.6, 0.8]]) 


Initial (instantaneous) volatility value. 


Fixed correlation between the two Brownian motions. 


Cholesky decomposition and resulting matrix. 


Before the start of the simulation of the stochastic processes the whole set of random 
numbers for both processes is generated, looking to use set 0 for the index process 
and set 1 for the volatility process. For the volatility process modeled by a square-root 
diffusion, the Euler scheme is chosen, taking into account the correlation via the 
Cholesky matrix: 


In [37]: M = 50 


In 


[38]: 


[39]: 


[40]: 


[41]: 


[42]: 


I = 10000 
dt=T/™ 


ran_num = npr.standard_normal((2, M+ 1, I)) (1) 


v = np.zeros_like(ran_num[0]) 
vh = np.zeros_like(v) 


v[0] = vO 
vh[0] = vO 


for t in range(1, M + 1): 
ran = np.dot(cho_mat, ran_num[:, t, :]) (2) 
vh[t] = (vh[t - 1] + 
kappa * (theta - np.maximum(vh[t - 1], 0)) * dt + 
sigma * np.sqrt(np.maximum(vh[t - 1], 0)) * 
math.sqrt(dt) * ran[1]) © 


v = np.maximum(vh, 0) 


Generates the three-dimensional random number data set. 


Picks out the relevant random number subset and transforms it via the Cholesky 


matrix. 
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© Simulates the paths based on an Euler scheme. 


The simulation of the index level process also takes into account the correlation and 
uses the (in this case) exact Euler scheme for the geometric Brownian motion. 
Figure 12-11 shows the simulation results at maturity as a histogram for both the 
index level process and the volatility process: 


In [43]: S = np.zeros_like(ran_num[0]) 
s[0] = SO 
for t in range(1, M + 1): 
ran = np.dot(cho_mat, ran_num[:, t, :]) 
S[t] = S[t - 1] * np.exp((r - 0.5 * v[t]) * dt + 
np.sqrt(v[t]) * ran[0] * np.sqrt(dt)) 


In [44]: fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6)) 
axi.hist(S[-1], bins=50) 
ax1.set_xlabel('index level') 
ax1.set_ylabel('frequency') 
ax2.hist(v[-1], bins=50) 
ax2.set_xlabel('volatility'); 
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Figure 12-11. Dynamically simulated stochastic volatility process at maturity 


This illustrates another advantage of working with the Euler scheme for the square- 
root diffusion: correlation is easily and consistently accounted for since one only draws 
standard normally distributed random numbers. There is no simple way of achieving 
the same with a mixed approach (i.e., using Euler for the index and the noncentral 
chi-square-based exact approach for the volatility process). 
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An inspection of the first 10 simulated paths of each process (see Figure 12-12) shows 
that the volatility process is drifting positively on average and that it, as expected, 
converges to 6 = 0.25: 


In [45]: print_statistics(S[-1], v[-1]) 


statistic data set 1 data set 2 
size 10000.000 10000 .000 
min 20.556 0.174 
max 517.798 0.328 
mean 107.843 0.243 
std 51:341 0.020 
skew 1S7? 0.124 
kurtosis 4.306 0.048 


In [46]: fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, 
figsize=(10, 6)) 
ax1.plot(S[:, :10], lw=1.5) 
ax1.set_ylabel('index level') 
ax2.plot(v[:, :10], lw=1.5) 
ax2.set_xlabel('time') 
ax2.set_ylabel('volatility'); 
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Figure 12-12. Dynamically simulated stochastic volatility process paths 


Having a brief look at the statistics for the maturity date for both data sets reveals a 
pretty high maximum value for the index level process. In fact, this is much higher 
than a geometric Brownian motion with constant volatility could ever climb, ceteris 
paribus. 
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Jump diffusion 


Stochastic volatility and the leverage effect are stylized (empirical) facts found in a 
number of markets. Another important stylized fact is the existence of jumps in asset 
prices and, for example, volatility. In 1976, Merton published his jump diffusion 
model, enhancing the Black-Scholes-Merton setup through a model component gen- 
erating jumps with log-normal distribution. The risk-neutral SDE is presented in 
Equation 12-8. 


Equation 12-8. Stochastic differential equation for Merton jump diffusion model 
dS, =(r - r;)S,dt + 0S,dZ,+J,S,dN, 


For completeness, here is an overview of the variables’ and parameters’ meaning: 


S, 
Index level at date t 


Constant riskless short rate 


Ty =a. hent.. 1) 


Drift correction for jump to maintain risk neutrality 


o 
Constant volatility of S 
Z, 
Standard Brownian motion 
Jı 
Jump at date t with distribution ... 
- ... log (1+J,) =N( log (1 +u) - 4, 8?) with... 
e ... N as the cumulative distribution function of a standard normal random 
variable 
N, 


Poisson process with intensity A 
Equation 12-9 presents an Euler discretization for the jump diffusion where the z,” 
are standard normally distributed and the y, are Poisson distributed with intensity À. 


Equation 12-9. Euler discretization for Merton jump diffusion model 


S, =z gale A 4 (e Hy +827 _ 1)y,) 
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Given the discretization scheme, consider the following numerical parameterization: 


© 8 8 


In [47]: SO = 100. 

r = 0.05 

sigma = 0.2 

lamb = 0.75 @ 

mu = -0.6 @ 

delta = 0.25 © 

rj = lamb * (math.exp(mu + 0.5 * delta ** 2) - 1) (4) 
In [48]: T = 1.0 
M= 50 
I = 10000 
dt=T/M 


The jump intensity. 
The mean jump size. 
The jump volatility. 


The drift correction. 


This time, three sets of random numbers are needed. Notice in Figure 12-13 the sec- 
ond peak (bimodal frequency distribution), which is due to the jumps: 


In [49]: S = np.zeros((M + 1, I)) 
s[0] = SO 
sni = npr.standard_normal((M + 1, I)) (1) 


sn2 = npr.standard_normal((M + 1, I)) (1) 
poi = npr.poisson(lamb * dt, (M + 1, I)) (2) 
for t in range(1, M + 1, 1): 
S[t] = S[t - 1] * (np.exp((r - rj - 0.5 * sigma ** 2) * dt + 


sigma * math.sqrt(dt) * sni[t]) + 
(np.exp(mu + delta * sn2[t]) - 1) * 
poi[t]) 

S[t] = np.maximum(S[t], 0) 


In [50]: plt.figure(figsize=(10, 6)) 
plt.hist(S[-1], bins=50) 
plt.xlabel('value') 
plt.ylabel('frequency'); 

Standard normally distributed random numbers. 


Poisson distributed random numbers. 


Simulation based on the exact Euler scheme. 
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Figure 12-13. Dynamically simulated jump diffusion process at maturity 


The negative jumps can also be spotted in the first 10 simulated index level paths, as 
presented in Figure 12-14: 


In [51]: plt.figure(figsize=(10, 6)) 
plt.plot(S[:, :10], lw=1.5) 
plt.xlabel('time') 
plt.ylabel('index level'); 
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Figure 12-14. Dynamically simulated jump diffusion process paths 


Variance Reduction 


Because the Python functions used so far generate pseudo-random numbers and due 
to the varying sizes of the samples drawn, the resulting sets of numbers might not 
exhibit statistics close enough to the expected or desired ones. For example, one 
would expect a set of standard normally distributed random numbers to show a 
mean of 0 and a standard deviation of 1. Let us check what statistics different sets of 
random numbers exhibit. To achieve a realistic comparison, the seed value for the 
random number generator is fixed: 


In [52]: print('%15s %15s' % ('Mean', 'Std. Deviation')) 
print(31 * '-') 
for i in range(1, 31, 2): 


eoooooooo © 


npr.seed(100) 
npr.standard_normal(i ** 2 * 10000) 


sn = 


print('%15.12f %15.12f' % (sn.mean(), sn.std())) 
Std. Deviation 


Mean 


. 001150944833 
- 002841204001 
. 001998082016 
- 901322322067 
.000592711311 
. 000339730751 
- 000228109010 
- 000295768719 
. 000257107789 


- 006296354600 
. 995987967146 
. 997701714233 
- 997771186968 
. 998388962646 
. 998399891450 
- 998657429396 
. 998877333340 
- 999284894532 
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-0.000357870642 
-0.000528443742 
-0.000300171536 
-0.000162924037 
0.000135778889 
0.000182006048 


0.999456401088 
0.999617831131 
0.999445228838 
0.999516059328 
0.999611052522 
0.999619405229 


In [53]: i ** 2 * 10000 

Out[53]: 8410000 
The results show that the statistics “somehow” get better the larger the number of 
draws becomes.’ But they still do not match the desired ones, even in our largest sam- 
ple with more than 8,000,000 random numbers. 


Fortunately, there are easy-to-implement, generic variance reduction techniques 
available to improve the matching of the first two moments of the (standard) normal 
distribution. The first technique is to use antithetic variates. This approach simply 
draws only half the desired number of random draws, and adds the same set of ran- 
dom numbers with the opposite sign afterward.’ For example, if the random number 
generator (i.e., the respective Python function) draws 0.5, then another number with 
value -0.5 is added to the set. By construction, the mean value of such a data set must 
equal zero. 


With NumPy this is concisely implemented by using the function np.concatenate(). 
The following repeats the exercise from before, this time using antithetic variates: 


In [54]: sn = npr.standard_normal(int(10000 / 2)) 
sn = np.concatenate((sn, -sn)) 


In [55]: np.shape(sn) (2) 
Out[55]: (10000,) 


In [56]: sn.mean() © 
Out[56]: 2.842170943040401e-18 


In [57]: print('%15s %15s' % ('Mean', 'Std. Deviation')) 
print(31 * "-") 
for i in range(1, 31, 2): 
npr.seed(1000) 
sn = npr.standard_normal(i ** 2 * int(10000 / 2)) 
sn = np.concatenate((sn, -sn)) 
print("%15.12f %15.12f" % (sn.mean(), sn.std())) 
Mean Std. Deviation 


0.000000000000 1.009653753942 


2 The approach here is inspired by the Law of Large Numbers. 


3 The described method works for symmetric median 0 random variables only, like standard normally dis- 
tributed random variables, which are almost exclusively used throughout. 
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-0.000000000000 1.000413716783 
0.000000000000 1.002925061201 
-0.000000000000 1.000755212673 
0.000000000000 1.001636910076 
-0.000000000000 1.000726758438 
-0.000000000000 1.001621265149 
0.000000000000 1.001203722778 
-0.000000000000 1.000556669784 
-0.000000000000 1.000113464185 
-0.000000000000 0.999435175324 
-0.000000000000 0.999356961431 
-0.000000000000 0.999641436845 
-0.000000000000 0.999642768905 
-0.000000000000 0.999638303451 


@ This concatenates the two ndarray objects ... 


(2) 


... to arrive at the desired number of random numbers. 


© The resulting mean value is zero (within standard floating-point arithmetic 
errors). 


As immediately noticed, this approach corrects the first moment perfectly—which 
should not come as a surprise due to the very construction of the data set. However, 
this approach does not have any influence on the second moment, the standard devi- 
ation. Using another variance reduction technique, called moment matching, helps 
correct in one step both the first and second moments: 


In [58]: sn = npr.standard_normal(10000) 
In [59]: 


Out[59]: 


sn.mean() 
-0.001165998295162494 


In [60]: 
Out[60]: 


sn.std() 
0.991255920204605 
In [61]: sn_new = (sn - sn.mean()) / sn.std() (1 


In [62]: 
Out[62]: 


sn_new.mean() 
-2.3803181647963357e-17 


In [63]: 
Out[63]: 


sn_new.std() 
@.9999999999999999 


By subtracting the mean from every single random number and dividing every single 
number by the standard deviation, this technique ensures that the set of random 
numbers matches the desired first and second moments of the standard normal dis- 
tribution (almost) perfectly. 


Corrects both the first and second moment in a single step. 
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The following function utilizes the insight with regard to variance reduction techni- 
ques and generates standard normal random numbers for process simulation using 
either two, one, or no variance reduction technique(s): 


In [64]: def gen_sn(M, I, anti_paths=True, mo_match=True): 
''' Function to generate random numbers for simulation. 


Parameters 


M: int 

number of time intervals for discretization 
T: int 

number of paths to be simulated 
anti_paths: boolean 

use of antithetic variates 
mo_math: boolean 

use of moment matching 


if anti_paths is True: 
sn = npr.standard_normal((M + 1, int(I / 2))) 
sn = np.concatenate((sn, -sn), axis=1) 
else: 
sn = npr.standard_normal((M + 1, I)) 
if mo_match is True: 
sn = (sn - sn.mean()) / sn.std() 
return sn 


Vectorization and Simulation 


Vectorization with NumPy is a natural, concise, and efficient 
approach to implementing Monte Carlo simulation algorithms in 
Python. However, using NumPy vectorization comes with a larger 
memory footprint in general. For alternatives that might be equally 
fast, see Chapter 10. 


Valuation 


One of the most important applications of Monte Carlo simulation is the valuation of 
contingent claims (options, derivatives, hybrid instruments, etc.). Simply stated, in a 
risk-neutral world, the value of a contingent claim is the discounted expected payoff 
under the risk-neutral (martingale) measure. This is the probability measure that 
makes all risk factors (stocks, indices, etc.) drift at the riskless short rate, making the 
discounted processes martingales. According to the Fundamental Theorem of Asset 
Pricing, the existence of such a probability measure is equivalent to the absence of 
arbitrage. 


A financial option embodies the right to buy (call option) or sell (put option) a speci- 
fied financial instrument at a given maturity date (European option), or over a 
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specified period of time (American option), at a given price (strike price). Let us first 
consider the simpler case of European options in terms of valuation. 


European Options 


The payoff of a European call option on an index at maturity is given by h(S;) = 
max(S,— K, 0), where S+ is the index level at maturity date T and K is the strike price. 
Given a, or in complete markets the, risk-neutral measure for the relevant stochastic 
process (e.g., geometric Brownian motion), the price of such an option is given by the 
formula in Equation 12-10. 


Equation 12-10. Pricing by risk-neutral expectation 


C, = e EF A(Ss)) = erf h (s)q(s)ds 


Chapter 11 sketches how to numerically evaluate an integral by Monte Carlo simula- 
tion. This approach is used in the following and applied to Equation 12-10. Equation 
12-11 provides the respective Monte Carlo estimator for the European option, where 
Ši is the Tth simulated index level at maturity. 


Equation 12-11. Risk-neutral Monte Carlo estimator 


= 1 ts 
Gye DAG) 


i=1 


Consider now the following parameterization for the geometric Brownian motion 
and the valuation function gbm_mcs_stat(), taking as a parameter only the strike 
price. Here, only the index level at maturity is simulated. As a reference, consider the 
case with a strike price of K = 105: 


In [65]: SO = 100. 


f= "0705 
sigma = 0.25 
TEXO 

I = 50000 


In [66]: def gbm_mcs_stat(K): 
''' Valuation of European call option in Black-Scholes-Merton 
by Monte Carlo simulation (of index level at maturity) 


Parameters 


K: float 
(positive) strike price of the option 


Returns 
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CO: float 
estimated present value of European call option 

sn = gen_sn(1i, I) 

# simulate index level at maturity 

ST = SO * np.exp((r - 0.5 * sigma ** 2) * T 
+ sigma * math.sqrt(T) * sn[1]) 

# calculate payoff at maturity 

hT = np.maximum(ST - K, 0) 

# calculate MCS estimator 

CO = math.exp(-r * T) * np.mean(hT) 

return CO 


In [67]: gbm_mcs_stat(K=105.) (1) 
Out[67]: 10.044221852841922 


@ The Monte Carlo estimator value for the European call option. 


Next, consider the dynamic simulation approach and allow for European put options 
in addition to the call option. The function gbm_mcs_dyna() implements the algo- 
rithm. The code also compares option price estimates for a call and a put stroke at the 
same level: 


In [68]: M = 50 @ 


In [69]: def gbm_mcs_dyna(K, option='call'): 
''' Valuation of European options in Black-Scholes-Merton 
by Monte Carlo simulation (of index level paths) 


Parameters 


K: float 
(positive) strike price of the option 
option : string 
type of the option to be valued ('call', 'put') 


Returns 


CO: float 
estimated present value of European call option 
dt=T/M 
# simulation of index level paths 
S = np.zeros((M + 1, I)) 
s[0] = SO 
sn = gen_sn(M, I) 
for t in range(1, M + 1): 
S[t] = S[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt 
+ sigma * math.sqrt(dt) * sn[t]) 
# case-based calculation of payoff 
if option == 'call': 
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In [70]: 
Out[70]: 


In [71]: 
Out[71]: 


hT = np.maximum(S[-1] - K, 0) 
else: 

hT = np.maximum(K - S[-1], 0) 
# calculation of MCS estimator 
CO = math.exp(-r * T) * np.mean(hT) 
return CO 


gbm_mcs_dyna(K=110., option='calLl') e 
7.950008525028434 


gbm_mcs_dyna(K=110., option='put') © 
12 .629934942682004 


@ The number of time intervals for the discretization. 


© The Monte Carlo estimator value for the European call option. 


© The Monte Carlo estimator value for the European put option. 


The question is how well these simulation-based valuation approaches perform rela- 
tive to the benchmark value from the Black-Scholes-Merton valuation formula. To 
find out, the following code generates respective option values/estimates for a range 
of strike prices, using the analytical option pricing formula for European calls found 
in the module bsm_functions.py (see “Python Script” on page 392). 


First, we compare the results from the static simulation approach with precise analyt- 


ical values: 


In [72]: 


In [73]: 


In [74]: 


In [75]: 


from bsm_functions import bsm_call_value 


stat_res = [] (1) 

dyna_res = [] (1) 

anal_res = [] (1) 

k_list = np.arange(80., 120.1, 5.) @ 
np.random.seed(100) 


for K in k_list: 
stat_res.append(gbm_mcs_stat(K)) © 
dyna_res.append(gbm_mcs_dyna(K)) © 
anal_res.append(bsm_call_value(S0, K, T, r, sigma)) © 


stat_res = np.array(stat_res) @ 
dyna_res = np.array(dyna_res) (4) 
anal_res = np.array(anal_res) 4] 


I 


@ Instantiates empty list objects to collect the results. 


@ Creates an ndarray object containing the range of strike prices. 


© Simulates/calculates and collects the option values for all strike prices. 
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© Transforms the list objects to ndarray objects. 


Figure 12-15 shows the results. All valuation differences are smaller than 1% abso- 
lutely. There are both negative and positive value differences: 


In [76]: plt.figure(figsize=(10, 6)) 
fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(10, 6)) 
ax1.plot(k_list, anal_res, 'b', label='analytical') 
axi.plot(k_list, stat_res, 'ro', label='static') 
ax1.set_ylabel('European call option value’) 
ax1.legend(loc=0) 
ax1.set_yLim(bottom=0) 
wi = 1.0 
ax2.bar(k_list - wi / 2, (anal_res - stat_res) / anal_res * 100, wi) 
ax2.set_xlabel('strike') 
ax2.set_ylabel('difference in %') 
ax2.set_xlim(left=75, right=125); 

Out[76]: <Figure size 720x432 with 0 Axes> 
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Figure 12-15. Analytical option values vs. Monte Carlo estimators (static simulation) 


A similar picture emerges for the dynamic simulation and valuation approach, whose 
results are reported in Figure 12-16. Again, all valuation differences are smaller than 
1% absolutely, with both positive and negative deviations. As a general rule, the qual- 
ity of the Monte Carlo estimator can be controlled for by adjusting the number of 
time intervals M used and/or the number of paths I simulated: 


In [77]: fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(10, 6)) 
ax1.plot(k_list, anal_res, 'b', label='analytical') 
ax1.plot(k_list, dyna_res, 'ro', label='dynamic') 
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ax1.set_ylabel('European call option value') 

ax1.legend(loc=0) 

ax1.set_ylim(bottom=0) 

wi = 1.0 

ax2.bar(k_list - wi / 2, (anal_res - dyna_res) / anal_res * 100, wi) 
ax2.set_xlabel('strike') 

ax2.set_ylabel('difference in %') 

ax2.set_xlim(left=75, right=125); 
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Figure 12-16. Analytical option values vs. Monte Carlo estimators (dynamic simula- 
tion) 


American Options 


The valuation of American options is more involved compared to European options. 
In this case, an optimal stopping problem has to be solved to come up with a fair value 
of the option. Equation 12-12 formulates the valuation of an American option as 
such a problem. The problem formulation is already based on a discrete time grid for 
use with numerical simulation. In a sense, it is therefore more correct to speak of an 
option value given Bermudan exercise. For the time interval converging to zero 
length, the value of the Bermudan option converges to the one of the American 
option. 


Equation 12-12. American option prices as optimal stopping problem 


Vo= supe“ Eg(h,(S,)) 


TE{0,At,2At....,T} 
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The algorithm described in the following is called Least-Squares Monte Carlo (LSM) 
and is from the paper by Longstaff and Schwartz (2001). It can be shown that the 
value of an American (Bermudan) option at any given date t is given as 
V,(s) = max (h,(s), C,(s)), where C,(s) = ER (e 7™™V 4 (Sra)| Si = s) is the so-called 
continuation value of the option given an index level of S=s. 


Consider now that we have simulated I paths of the index level over M time intervals 
of equal size At. Define Y,; = e 'V,,,,; to be the simulated continuation value for 
path i at time t. We cannot use this number directly because it would imply perfect 
foresight. However, we can use the cross section of all such simulated continuation 
values to estimate the (expected) continuation value by least-squares regression. 


Given a set of basis functions b}, d =1, +- , D, the continuation value is then given 
by the regression estimate C,; = ¥7.,a,°b,(S,;), where the optimal regression 
parameters a are the solution of the least-squares problem stated in Equation 12-13. 


Equation 12-13. Least-squares regression for American option valuation 


ee 2 i 
min FÈ (Yu - È aay b(S,,)) 


Hy prep i=l d=1 


The function gbm_mcs_amer() implements the LSM algorithm for both American call 
and put options:* 


In [78]: def gbm_mcs_amer(K, option='call'): 
"'' Valuation of American option in Black-Scholes-Merton 
by Monte Carlo simulation by LSM algorithm 


Parameters 


K: float 
(positive) strike price of the option 
option: string 
type of the option to be valued ('call', 'put') 


Returns 


CO: float 
estimated present value of American call option 
dt=T/M 
df = math.exp(-r * dt) 
# simulation of index levels 
S = np.zeros((M + 1, I)) 
S[0] = SO 


4 For algorithmic details, refer to Hilpisch (2015). 
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sn = gen_sn(M, I) 
for t in range(1, M+ 1): 
S[t] = S[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt 
+ sigma * math.sqrt(dt) * sn[t]) 
# case based calculation of payoff 
if option == “call”: 
h = np.maximum(S - K, 0) 
else: 
h = np.maximum(K - S, 0) 
# LSM algorithm 
V = np.copy(h) 
for t in range(M - 1, 0, -1): 
reg = np.polyfit(S[t], V[t + 1] * df, 7) 
C = np.polyval(reg, S[t]) 
V[t] = np.where(C > h[t], V[t + 1] * df, h[t]) 
# MCS estimator 
CO = df * np.mean(V[1]) 
return CO 


In [79]: gbm_mcs_amer(110., option='call') 
Out[79]: 7.721705606305352 


In [80]: gbm_mcs_amer(110., option='put') 

Out[80]: 13.609997625418051 
The European value of an option represents a lower bound to the American option’s 
value. The difference is generally called the early exercise premium. What follows 
compares European and American option values for the same range of strikes as 
before to estimate the early exercise premium, this time with puts:° 


In [81]: euro_res = [] 
amer_res = [] 


In [82]: k_list = np.arange(80., 120.1, 5.) 


In [83]: for K in k_list: 
euro_res.append(gbm_mcs_dyna(K, 'put')) 
amer_res.append(gbm_mcs_amer(K, 'put')) 


In [84]: euro_res = np.array(euro_res) 
amer_res = np.array(amer_res) 
Figure 12-17 shows that for the range of strikes chosen the early exercise premium 
can rise to up to 10%: 
In [85]: fig, (axl, ax2) = plt.subplots(2, 1, sharex=True, figsize=(10, 6)) 


ax1.plot(k_list, euro_res, 'b', label='European put') 
ax1.plot(k_list, amer_res, 'ro', label='American put') 


5 Since no dividend payments are assumed (having an index in mind), there generally is no early exercise pre- 
mium for call options (i.e., no incentive to exercise the option early). 
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ax1.set_ylabel('call option value') 

ax1.legend(loc=0) 

wi = 1.0 

ax2.bar(k_list - wi / 2, (amer_res - euro_res) / euro_res * 100, wi) 
ax2.set_xlabel('strike') 

ax2.set_ylabel('early exercise premium in %') 

ax2.set_xlim(left=75, right=125); 
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Figure 12-17. European vs. American Monte Carlo estimators 


Risk Measures 


In addition to valuation, risk management is another important application area of 
stochastic methods and simulation. This section illustrates the calculation/estimation 
of two of the most common risk measures applied today in the finance industry. 


Value-at-Risk 


Value-at-risk (VaR) is one of the most widely used risk measures, and a much deba- 
ted one. Loved by practitioners for its intuitive appeal, it is widely discussed and criti- 
cized by many—mainly on theoretical grounds, with regard to its limited ability to 
capture what is called tail risk (more on this shortly). In words, VaR is a number 
denoted in currency units (e.g., USD, EUR, JPY) indicating a loss (of a portfolio, a 
single position, etc.) that is not exceeded with some confidence level (probability) 
over a given period of time. 
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Consider a stock position, worth 1 million USD today, that has a VaR of 50,000 USD 
at a confidence level of 99% over a time period of 30 days (one month). This VaR 
figure says that with a probability of 99% (i-e., in 99 out of 100 cases), the loss to be 
expected over a period of 30 days will not exceed 50,000 USD. However, it does not 
say anything about the size of the loss once a loss beyond 50,000 USD occurs—i.e., if 
the maximum loss is 100,000 or 500,000 USD what the probability of such a specific 
“higher than VaR loss” is. All it says is that there is a 1% probability that a loss of a 
minimum of 50,000 USD or higher will occur. 


Assume the Black-Scholes-Merton setup and consider the following parameterization 
and simulation of index levels at a future date T = 30/365 (a period of 30 days). The 
estimation of VaR figures requires the simulated absolute profits and losses relative 
to the value of the position today in a sorted manner, i.e., from the severest loss to the 
largest profit. Figure 12-18 shows the histogram of the simulated absolute perfor- 
mance values: 


In [86]: SO = 100 


r = 0.05 
sigma = 0.25 
T = 30 / 365. 
I = 10000 


In [87]: ST = SO * np.exp((r - 0.5 * sigma ** 2) * T + 
sigma * np.sqrt(T) * npr.standard_normal(1)) (1) 


In [88]: R_gbm = np.sort(ST - SQ) (2) 

In [89]: plt.figure(figsize=(10, 6)) 
plt.hist(R_gbm, bins=50) 
plt.xlabel('absolute return') 
plt.ylabel('frequency'); 


@ Simulates end-of-period values for the geometric Brownian motion. 


© Calculates the absolute profits and losses per simulation run and sorts the values. 
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Figure 12-18. Absolute profits and losses from simulation (geometric Brownian motion) 


Having the ndarray object with the sorted results, the scs.scoreatpercentile() 
function already does the trick. All one has to do is to define the percentiles of inter- 
est (in percent values). In the list object percs, 0.1 translates into a confidence level 
of 100% - 0.1% = 99.9%. The 30-day VaR given a confidence level of 99.9% in this 
case is 18.8 currency units, while it is 8.5 at the 90% confidence level: 


In [91]: percs = [0.01, 0.1, 1., 2.5, 5.0, 10.0] 
var = scs.scoreatpercentile(R_gbm, percs) 
print('%16s %16s' % ('Confidence Level', 'Value-at-Risk')) 
print(33 * '-') 
for pair in zip(percs, var): 
print('%16.2f %16.3f' % (100 - pair[0], -pair[1])) 
Confidence Level Value-at-Risk 


99.99 21.814 
99.90 18.837 
99.00 15.230 
97.50 12:816 
95.00 10.824 
90.00 8.504 


As a second example, recall the jump diffusion setup from Merton, which is simula- 
ted dynamically. In this case, with the jump component having a negative mean, one 
sees something like a bimodal distribution for the simulated profits/losses in 
Figure 12-19. From a normal distribution point of view, one sees a pronounced left 
fat tail: 
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In [92]: dt = 30. / 365 /M 
rj = lamb * (math.exp(mu + 0.5 * delta ** 2) - 1) 


In [93]: S = np.zeros((M + 1, I)) 


s[0] = SO 
sni = npr.standard_normal((M + 1, I)) 
sn2 = npr.standard_normal((M + 1, I)) 
poi = npr.poisson(lamb * dt, (M + 1, I)) 
for t im range(i, M+ 1, 1): 
S[t] = S[t - 1] * (np.exp((r - rj - 0.5 * sigma ** 2) * dt 


+ sigma * math.sqrt(dt) * sni[t]) 
+ (np.exp(mu + delta * sn2[t]) - 1) 
* poi[t]) 

S[t] = np.maximum(S[t], 0) 


In [94]: R_jd = np.sort(S[-1] - SO) 


In [95]: plt.figure(figsize=(10, 6)) 
plt.hist(R_jd, bins=50) 
plt.xlabel('absolute return’) 
plt.ylabel('frequency'); 
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Figure 12-19. Absolute profits and losses from simulation (jump diffusion) 
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For this process and parameterization, the VaR over 30 days at the 90% level is 
almost identical as with the geometric Brownian motion, while it is more than three 
times as high at the 99.9% level (70 vs. 18.8 currency units): 


In [96]: perces = [0.01, 0.1, 1., 2.5, 5.0, 10.0] 
var = scs.scoreatpercentile(R_jd, percs) 
print('%16s %16s' % ('Confidence Level', 'Value-at-Risk')) 
print(33 * '-') 
for pair in zip(percs, var): 
print('%16.2f %16.3f' % (100 - pair[0], -pair[1])) 


Confidence Level Value-at-Risk 
99.99 76.520 
99.90 69.396 
99.00 55.974 
97.50 46.405 
95.00 24.198 
90.00 8.836 


This illustrates the problem of capturing the tail risk so often encountered in finan- 
cial markets by the standard VaR measure. 


To further illustrate the point, Figure 12-20 lastly shows the VaR measures for both 
cases in direct comparison graphically. As the plot reveals, the VaR measures behave 
completely differently given a range of typical confidence levels: 


In [97]: percs = list(np.arange(0.0, 10.1, 0.1)) 
gbm_var = scs.scoreatpercentile(R_gbm, percs) 
jd_var = scs.scoreatpercentile(R_jd, percs) 


In [98]: plt.figure(figsize=(10, 6)) 
plt.plot(percs, gbm_var, 'b', lw=1.5, Label='GBM') 
plt.plot(percs, jd_var, 'r', lw=1.5, label='JD') 
plt.legend(loc=4) 
plt.xlabel('100 - confidence level [%]') 
plt.ylabel('value-at-risk') 
plt.ylim(ymax=0.0); 
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Figure 12-20. Value-at-risk for geometric Brownian motion and jump diffusion 


Credit Valuation Adjustments 


Other important risk measures are the credit value-at-risk (CVaR) and the credit val- 
uation adjustment (CVA), which is derived from the CVaR. Roughly speaking, CVaR 
is a measure for the risk resulting from the possibility that a counterparty might not 
be able to honor its obligations—for example, if the counterparty goes bankrupt. In 
such a case there are two main assumptions to be made: the probability of default and 
the (average) loss level. 


To make it specific, consider again the benchmark setup of Black-Scholes-Merton 
with the parameterization in the following code. In the simplest case, one considers a 
fixed (average) loss level L and a fixed probability p of default (per year) of a counter- 
party. Using the Poisson distribution, default scenarios are generated as follows, tak- 
ing into account that a default can only occur once: 


In [99]: SO = 100. 


r = 0.05 
sigma = 0.2 
T=1., 

I = 100000 


In [100]: ST = SO * np.exp((r - 0.5 * sigma ** 2) * T 
+ sigma * np.sqrt(T) * npr.standard_normal(I)) 


In [101]: L=0.5 @ 


388 | Chapter 12: Stochastics 


In [102]: p= 0.01 @ 


In [103]: D = npr.poisson(p * T, I) © 


In [104]: D = np.where(D > 1, 1, D) 4] 


Defines the loss level. 


© 


Defines the probability of default. 


© 


Simulates default events. 


Limits defaults to one such event. 


Without default, the risk-neutral value of the future index level should be equal to the 
current value of the asset today (up to differences resulting from numerical errors). 
The CVaR and the present value of the asset, adjusted for the credit risk, are given as 
follows: 


In [105]: math.exp(-r * T) * np.mean(ST) (13 
Out[105]: 99.94767178982691 


In [106]: CVaR = math.exp(-r * T) * np.mean(L * D * ST) (2) 
cva @ 
Out[106]: 0.4883560258963962 


In [107]: SO_CVA = math.exp(-r * T) * np.mean((1 - L * D) * ST) © 
se_cva © 
Out[107]: 99.45931576393053 


In [108]: S@_adj = SO - CVarkR @ 
so_adj @ 
Out[108]: 99.5116439741036 


@ Discounted average simulated value of the asset at T. 
CVaR as the discounted average of the future losses in the case of a default. 


© Discounted average simulated value of the asset at T, adjusted for the simulated 
losses from default. 


© Current price of the asset adjusted by the simulated CVaR. 


In this particular simulation example, one observes roughly 1,000 losses due to credit 
risk, which is to be expected given the assumed default probability of 1% and 100,000 
simulated paths. Figure 12-21 shows the complete frequency distribution of the losses 
due to a default. Of course, in the large majority of cases (i-e., in about 99,000 of the 
100,000 cases) there is no loss to observe: 
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In [109]: np.count_nonzero(L * D * ST) (13 
Out[109]: 978 


In [110]: plt.figure(figsize=(10, 6)) 
plt.hist(L * D * ST, bins=50) 
plt.xlabel('loss') 
plt.ylabel('frequency') 
plt.ylim(ymax=175); 


@ Number of default events and therewith loss events. 
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Figure 12-21. Losses due to risk-neutrally expected default (stock) 


Consider now the case of a European call option. Its value is about 10.4 currency 
units at a strike of 100. The CVaR is about 5 cents given the same assumptions with 
regard to probability of default and loss level: 


In [111]: K = 100. 
hT = np.maximum(ST - K, 0) 


In [112]: CO = math.exp(-r * T) * np.mean(hT) (13 
co 
Out[112]: 10.396916492839354 


In [113]: CVaR = math.exp(-r * T) * np.mean(L * D * hT) (2) 
CVaR 
Out[113]: 0.05159099858923533 


In [114]: CO_CVA = math.exp(-r * T) * np.mean((1 - L * D) * hT) © 
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cocva © 
Out[114]: 10.34532549425012 


The Monte Carlo estimator value for the European call option. 
The CVaR as the discounted average of the future losses in the case of a default. 


The Monte Carlo estimator value for the European call option, adjusted for the 
simulated losses from default. 


Compared to the case of a regular asset, the option case has somewhat different char- 
acteristics. One only sees a little more than 500 losses due to a default, although there 
are again 1,000 defaults in total. This results from the fact that the payoff of the 
option at maturity has a high probability of being zero. Figure 12-22 shows that the 
CVaR for the option has quite a different frequency distribution compared to the reg- 
ular asset case: 


In [115]: np.count_nonzero(L * D * hT) (1) 
Out[115]: 538 

In [116]: np.count_nonzero(D) (2) 
Out[116]: 978 

In [117]: I - np.count_nonzero(hT) © 
Out[117]: 44123 

In [118]: plt.figure(figsize=(10, 6)) 


plt.hist(L * D * hT, bins=50) 
plt.xlabel('loss') 
plt.ylabel('frequency') 
plt.ylim(ymax=350); 

© The number of losses due to default. 

@ The number of defaults. 


© The number of cases for which the option expires worthless. 
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Figure 12-22. Losses due to risk-neutrally expected default (call option) 


Python Script 


The following presents an implementation of central functions related to the Black- 
Scholes-Merton model for the analytical pricing of European (call) options. For 
details of the model, see Black and Scholes (1973) as well as Merton (1973). See 
Appendix B for an alternative implementation based on a Python class. 


Valuation of European call options 

in Black-Scholes-Merton model 

incl. vega function and implied volatility estimation 
bsm_functions.py 


(c) Dr. Yves J. Hilpisch 
Python for Finance, 2nd ed. 


RRR RHR RHR RR 


def bsm_call_value(SO, K, T, r, sigma): 
''' Valuation of European call option in BSM model. 
Analytical formula. 


Parameters 


S0: float 
initial stock/index level 
Ke Float 
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strike price 
T: float 

maturity date (in year fractions) 
r: float 

constant risk-free short rate 
sigma: float 

volatility factor in diffusion term 


Returns 


value: float 

present value of the European call option 
from math import log, sqrt, exp 
from scipy import stats 


SO = float(S0) 
d1 = (log(SO / K) + (r + 0.5 * sigma ** 2) * T) / (sigma * sqrt(T)) 
d2 = (log(SO / K) + (r - 0.5 * sigma ** 2) * T) / (sigma * sqrt(T)) 
# stats.norm.cdf --> cumulative distribution function 
# for normal distribution 
value = (SO * stats.norm.cdf(d1, 0.0, 1.0) - 
K * exp(-r * T) * stats.norm.cdf(d2, 0.0, 1.0)) 
return value 


def bsm_vega(SQ, K, T, r, sigma): 
''' Vega of European option in BSM model. 


Parameters 


S0: float 

initial stock/index level 
K: float 

strike price 
T: float 

maturity date (in year fractions) 
r: float 

constant risk-free short rate 
sigma: float 

volatility factor in diffusion term 


Returns 


vega: float 
partial derivative of BSM formula with respect 
to sigma, i.e. vega 


ttt 


from math import log, sqrt 
from scipy import stats 
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SO = float(S0) 

d1 = (log(SO / K) + (r + 0.5 * sigma ** 2) * T) / (sigma * sqrt(T)) 
vega = SQ * stats.norm.pdf(d1i, 0.0, 1.0) * sqrt(T) 

return vega 


# Implied volatility function 


def bsm_call_imp_vol(S0O, K, T, r, CO, sigma_est, it=100): 
''' Implied volatility of European call option in BSM model. 


Parameters 


SO: float 

initial stock/index level 
K: float 

strike price 
T: float 

maturity date (in year fractions) 
r: float 

constant risk-free short rate 
sigma_est: float 

estimate of impl. volatility 
it: integer 

number of iterations 


Returns 


simga_est: float 
numerically estimated implied volatility 


Fri 


for i in range(it): 
sigma_est -= ((bsm_call_value(S0, K, T, r, sigma_est) - CO) / 
bsm_vega(S0, K, T, r, sigma_est)) 
return sigma_est 


Conclusion 


This chapter deals with methods and techniques important to the application of 
Monte Carlo simulation in finance. In particular, it first shows how to generate 
pseudo-random numbers based on different distribution laws. It proceeds with the 
simulation of random variables and stochastic processes, which is important in many 
financial areas. Two application areas are discussed in some depth in this chapter: 
valuation of options with European and American exercise and the estimation of risk 
measures like value-at-risk and credit valuation adjustments. 


The chapter illustrates that Python in combination with NumPy is well suited to imple- 
menting even such computationally demanding tasks as the valuation of American 
options by Monte Carlo simulation. This is mainly due to the fact that the majority of 
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functions and classes of NumPy are implemented in C, which leads to considerable 
speed advantages in general over pure Python code. A further benefit is the compact- 
ness and readability of the resulting code due to vectorized operations. 


Further Resources 


The original article introducing Monte Carlo simulation to finance is: 


e Boyle, Phelim (1977). “Options: A Monte Carlo Approach.” Journal of Financial 
Economics, Vol. 4, No. 4, pp. 322-338. 


Other original papers cited in this chapter are (see also Chapter 18): 


e Black, Fischer, and Myron Scholes (1973). “The Pricing of Options and Corpo- 
rate Liabilities.” Journal of Political Economy, Vol. 81, No. 3, pp. 638-659. 


Cox, John, Jonathan Ingersoll, and Stephen Ross (1985). “A Theory of the Term 
Structure of Interest Rates.” Econometrica, Vol. 53, No. 2, pp. 385-407. 


Heston, Steven (1993). “A Closed-Form Solution for Options with Stochastic 
Volatility with Applications to Bond and Currency Options.” The Review of 
Financial Studies, Vol. 6, No. 2, 327-343. 

Merton, Robert (1973). “Theory of Rational Option Pricing.” Bell Journal of Eco- 
nomics and Management Science, Vol. 4, pp. 141-183. 


Merton, Robert (1976). “Option Pricing When the Underlying Stock Returns Are 
Discontinuous.” Journal of Financial Economics, Vol. 3, No. 3, pp. 125-144. 


The following books cover the topics of this chapter in more depth (however, the first 
one does not cover technical implementation details): 


e Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. New 
York: Springer. 

e Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


It took until the turn of the century for an efficient method to value American 
options by Monte Carlo simulation to finally be published: 


e Longstaff, Francis, and Eduardo Schwartz (2001). “Valuing American Options by 
Simulation: A Simple Least Squares Approach.” Review of Financial Studies, Vol. 
14, No. 1, pp. 113-147. 
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A broad and in-depth treatment of credit risk is provided in: 


e Duffie, Darrell, and Kenneth Singleton (2003). Credit Risk—Pricing, Measure- 
ment, and Management. Princeton, NJ: Princeton University Press. 
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CHAPTER 13 
Statistics 


I can prove anything by statistics except the truth. 


—George Canning 


Statistics is a vast field, but the tools and results it provides have become indispensa- 
ble for finance. This explains the popularity of domain-specific languages like R in 
the finance industry. The more elaborate and complex statistical models become, the 
more important it is to have available easy-to-use and high-performing computa- 
tional solutions. 


A single chapter in a book like this one cannot do justice to the richness and depth of 
the field of statistics. Therefore, the approach—as in many other chapters—is to focus 
on selected topics that seem of importance or that provide a good starting point when 
it comes to the use of Python for the particular tasks at hand. The chapter has four 
focal points: 


“Normality Tests” on page 398 
A large number of important financial models, like modern or mean-variance 
portfolio theory (MPT) and the capital asset pricing model (CAPM), rest on the 
assumption that returns of securities are normally distributed. Therefore, this 
chapter presents approaches to test a given time series for normality of returns. 


“Portfolio Optimization” on page 415 
MPT can be considered one of the biggest successes of statistics in finance. Start- 
ing in the early 1950s with the work of pioneer Harry Markowitz, this theory 
began to replace people’s reliance on judgment and experience with rigorous 
mathematical and statistical methods when it comes to the investment of money 
in financial markets. In that sense, it is maybe the first real quantitative model 
and approach in finance. 


397 


“Bayesian Statistics” on page 429 


On a conceptual level, Bayesian statistics introduces the notion of beliefs of 
agents and the updating of beliefs to statistics. When it comes to linear regression, 
this might take the form of having a statistical distribution for regression param- 
eters instead of single point estimates (e.g., for the intercept and slope of the 
regression line). Nowadays, Bayesian methods are widely used in finance, which 
is why this section illustrates Bayesian methods based on some examples. 


“Machine Learning” on page 444 


Machine learning (or statistical learning) is based on advanced statistical meth- 
ods and is considered a subdiscipline of artificial intelligence (AI). Like statistics 
itself, machine learning offers a rich set of approaches and models to learn from 
data sets and create predictions based on what is learned. Different algorithms of 
learning are distinguished, such as those for supervised learning or unsupervised 
learning. The types of problems solved by the algorithms differ as well, such as 
estimation or classification. The examples presented in this chapter fall in the cat- 
egory of supervised learning for classification. 


Many aspects in this chapter relate to date and/or time information. Refer to Appen- 
dix A for an overview of handling such data with Python, NumPy, and pandas. 


Normality Tests 


The normal distribution can be considered the most important distribution in finance 
and one of the major statistical building blocks of financial theory. Among others, the 
following cornerstones of financial theory rest to a large extent on the assumption 
that returns of a financial instrument are normally distributed:’ 


Portfolio theory 


When stock returns are normally distributed, optimal portfolio choice can be 
cast into a setting where only the (expected) mean return and the variance of the 
returns (or the volatility) as well as the covariances between different stocks are 
relevant for an investment decision (i.e., an optimal portfolio composition). 


Capital asset pricing model 


Again, when stock returns are normally distributed, prices of single stocks can be 
elegantly expressed in linear relationship to a broad market index; the relation- 
ship is generally expressed by a measure for the co-movement of a single stock 
with the market index called beta or $. 


= 


Another central assumption is the one of linearity. For example, financial markets are assumed, in general, to 


exhibit a linear relationship between demand, say for shares of a stock, and the price to be paid for the shares. 
In other words, markets are assumed, in general, to be perfectly liquid in the sense that varying demand does 
not have any influence on the unit price for a financial instrument. 
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Efficient markets hypothesis 
An efficient market is a market where prices reflect all available information, 
where “all” can be defined more narrowly or more widely (e.g., as in “all publicly 
available” information vs. including also “only privately available” information). 
If this hypothesis holds true, then stock prices fluctuate randomly and returns are 
normally distributed. 


Option pricing theory 
Brownian motion is the benchmark model for the modeling of random price 
movements of financial instruments; the famous Black-Scholes-Merton option 
pricing formula uses a geometric Brownian motion as the model for a stock’s 
random price fluctuations over time, leading to log-normally distributed prices 
and normally distributed returns. 


This by far nonexhaustive list underpins the importance of the normality assumption 
in finance. 


Benchmark Case 


To set the stage for further analyses, the analysis starts with the geometric Brownian 
motion as one of the canonical stochastic processes used in financial modeling. The 
following can be said about the characteristics of paths from a geometric Brownian 
motion S: 


Normal log returns 


S, 
Log returns log = = log S, — log S, between two times 0 < s < t are normally 
distributed. 


Log-normal values 
At any time t > 0, the values S, are log-normally distributed. 


For what follows, the plotting setup is taken care of first. Then a number of Python 
packages, including scipy.stats and statsmodels. api, are imported: 


In [1]: import math 
import numpy as np 
import scipy.stats as scs 
import statsmodels.api as sm 
from pylab import mpl, plt 


In [2]: plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


The following uses the function gen_paths() to generate sample Monte Carlo paths 
for the geometric Brownian motion (see also Chapter 12): 
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In [3]: def gen_paths(SO, r, sigma, T, M, I): 


''' Generate Monte Carlo paths for geometric Brownian motion. 


Parameters 


SO: float 
initial stock/index value 
r: float 
constant short rate 
sigma: float 
constant volatility 
T: float 
final time horizon 
M: int 
number of time steps/intervals 
Teint 
number of paths to be simulated 


Returns 


paths: ndarray, shape (M + 1, I) 
simulated paths given the parameters 
dt =T/M 
paths = np.zeros((M + 1, I)) 
paths[0] = SO 
for t in range(1, M+ 1): 
rand = np.random.standard_normal(I) 
rand = (rand - rand.mean()) / rand.std() 1) 
paths[t] = paths[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt + 
sigma * math.sqrt(dt) * rand) (2) 
return paths 


@ Matching first and second moment. 


© Vectorized Euler discretization of geometric Brownian motion. 


The simulation is based on the parameterization for the Monte Carlo simulation as 
shown here, generating, in combination with the function gen_paths(), 250,000 
paths with 50 time steps each. Figure 13-1 shows the first 10 simulated paths: 


In [4]: So = 100. @ 


r=0.05 @ 
sigma = 0.2 © 
T=10 O 
mM=50 © 
I = 250000 @ 


np.random.seed(1000) 


In [5]: paths = gen_paths(SO, r, sigma, T, M, I) 


In [6]: SO * math.exp(r * T) @ 
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Out[6]: 


In [7]: 
Out[7]: 


In [8]: 


© © O © O © Ọ@ 


105.12710963760242 


paths[-1].mean() (7) 
105.12645392478755 


plt.figure(figsize=(10, 6)) 


plt.plot(paths[:, :10]) 
plt.xlabel('time steps') 


plt.ylabel('index level'); 


Constant short rate. 

Constant volatility factor. 
Time horizon in year fractions. 
Number of time intervals. 


Number of simulated processes. 


Initial value for simulated processes. 


Expected value and average simulated value. 
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Figure 13-1. Ten simulated paths of geometric Brownian motion 
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The main interest is in the distribution of the log returns. To this end, an ndarray 
object with all the log returns is created based on the simulated paths. Here, a single 
simulated path and the resulting log returns are shown: 


In [9]: paths[:, 0].round(4) 

Out[9]: array([100. so 97.824, , 98:5573, 106,1546, 105.899 , 99.8363, 
100.0145, 102.6589, 105.6643, 107.1107, 108.7943, 108.2449, 
106.4105, 101.0575, 102.0197, 102.6052, 109.6419, 109.5725, 
112.9766, 113.0225, 112.5476, 114.5585, 109.942 , 112.6271, 
112.7502, 116.3453, 115.0443, 113.9586, 115.8831, 117.3705, 
117.9185, 110.5539, 109.9687, 104.9957, 108.0679, 105.7822, 
105.1585, 104.3304, 108.4387, 105.5963, 108.866 , 108.3284, 
107.0077, 106.0034, 104.3964, 101.0637, 98.3776, 97.135 , 
95.4254, 96.4271, 96.3386]) 


In [10]: log_returns = np.log(paths[1:] / paths[:-1]) 


In [11]: log_returns[:, 0].round(4) 


Out[11]: array([-0.022 , 0.0075, 0.0743, -0.0024, -0.059 , 0.0018, 0.0261, 
0.0289, 0.0136, 0.0156, -0.0051, -0.0171, -0.0516, 0.0095, 
0.0057, 0.0663, -0.0006, 0.0306, 0.0004, -0.0042, 0.0177, 
-@.0411, 0.0241, 0.0011, 0.0314, -0.0112, -0.0095, 0.0167, 
0.0128, 0.0047, -0.0645, -0.0053, -0.0463, 0.0288, -0.0214, 
-@.0059, -0.0079, 0.0386, -0.0266, 0.0305, -0.0049, -0.0123, 
-0.0094, -Q.0153, -0.0324, -0.0269, -0.0127, -0.0178, 0.0104, 


-0.0009]) 


This is something one might experience in financial markets as well: days when one 
makes a positive return on an investment and other days when one is losing money 
relative to the most recent wealth position. 


The function print_statistics() is a wrapper function for the scs.describe() 
function from the scipy.stats subpackage. It mainly generates a better 
(human-)readable output for such statistics as the mean, the skewness, or the kurtosis 
of a given (historical or simulated) data set: 


In [13]: def print_statistics(array): 
''' Prints selected statistics. 


Parameters 


array: ndarray 

object to generate statistics on 
sta = scs.describe(array) 
print('%14s %15s' % ('statistic', 'value')) 
print(30 * '-') 
print('%14s %15.5f' % ('size', sta[0])) 
print('%14s %15.5f' % ('min', sta[1][0])) 
print('%14s %15.5f' % ('max', sta[1][1])) 
print('%14s %15.5f' % ('mean', sta[2])) 
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print('%14s %15.5f' % ('std', np.sqrt(sta[3]))) 
print('%14s %15.5f' % ('skew', sta[4])) 
print('%14s %15.5f' % ('kurtosis', sta[5])) 


In [14]: print_statistics(log_returns.flatten()) 


In [15]: 
Out[15]: 


In [16]: 
Out[16]: 


log_ 


statistic value 


size 12500000.00000 


min 0.15664 

max 015371 

mean 0.00060 

std 0.02828 

skew 0.00055 
kurtosis 0.00085 


returns.mean() * M + 0.5 * sigma ** 2 (13 


0.05000000000000005 


log_ 


returns.std() * math.sqrt(M) (2) 


0.20000000000000015 


@ Annualized mean log return after correction for the Ité term.” 


© Annualized volatility; i.e., annualized standard deviation of log returns. 


The data set in this case consists of 12,500,000 data points with the values mainly 
lying between +/- 0.15. One would expect annualized values of 0.05 for the mean 
return (after correcting for the Itô term) and 0.2 for the standard deviation (volatil- 
ity). The annualized values almost match these values perfectly (multiply the mean 
value by 50 and correct it for the It6 term; multiply the standard deviation by /50). 
One reason for the good match is the use of moment matching for variance reduction 
when drawing the random numbers (see “Variance Reduction” on page 372). 


Figure 13-2 compares the frequency distribution of the simulated log returns with the 
probability density function (PDF) of the normal distribution given the parameteri- 
zations for r and sigma. The function used is norm. pdf() from the scipy.stats sub- 
package. There is obviously quite a good fit: 


In [17]: ptt: 
plt. 


figure(figsize=(10, 6)) 
hist(log_returns.flatten(), bins=70, density=True, 
label='frequency', color='b') 


.xlabel('log return') 

-ylabel('frequency') 

= np. linspace(plt.axis()[0], plt.axis()[1]) 

-plot(x, scs.norm.pdf(x, loc=r / M, scale=sigma / np.sqrt(M)), 


2 For the fundamentals of stochastic and Itô calculus needed in this context, refer to Glasserman (2004). 
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'r', lw=2.0, label='pdf') © 
plt.legend(); 


@ Plots the PDF for the assumed parameters scaled to the interval length. 
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Figure 13-2. Histogram of log returns of geometric Brownian motion and normal den- 
sity function 


Comparing a frequency distribution (histogram) with a theoretical PDF is not the 
only way to graphically “test” for normality. So-called quantile-quantile (QQ) plots 
are also well suited for this task. Here, sample quantile values are compared to theo- 
retical quantile values. For normally distributed sample data sets, such a plot might 
look like Figure 13-3, with the absolute majority of the quantile values (dots) lying on 
a straight line: 

In [18]: sm.qqplot(log_returns.flatten()[::500], line='s') 


plt.xlabel('theoretical quantiles') 
plt.ylabel('sample quantiles'); 
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Figure 13-3. Quantile-quantile plot for log returns of geometric Brownian motion 


However appealing the graphical approaches might be, they generally cannot replace 
more rigorous testing procedures. The function normality_tests() used in the next 
example combines three different statistical tests: 


Skewness test (skewtest()) 
This tests whether the skew of the sample data is “normal” (i.e., has a value close 
enough to zero). 


Kurtosis test (kurtosistest()) 
Similarly, this tests whether the kurtosis of the sample data is “normal” (again, 
close enough to zero). 


Normality test (normaltest()) 
This combines the other two test approaches to test for normality. 


The test values indicate that the log returns of the geometric Brownian motion are 
indeed normally distributed—i.e., they show p-values of 0.05 or above: 


In [19]: def normality_tests(arr): 
''' Tests for normality distribution of given data set. 


Parameters 


array: ndarray 
object to generate statistics on 
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rr 


print('Skew of data set %14.3f' % 
print('Skew test p-value %14.3f' % 
print('Kurt of data set %14.3f' % 
print('Kurt test p-value %14.3f' % 


print('Norm test p-value %14.3f' % 
In [20]: normality_tests(log_returns.flatten()) 
Skew of data set 0.001 
Skew test p-value 0.430 
Kurt of data set 0.001 
Kurt test p-value 0.541 
Norm test p-value 0.607 


© All p-values are well above 0.05. 


Finally, a check whether the end-of-period values are indeed log-normally dis- 
tributed. This boils down to a normality test, since one only has to transform the data 
by applying the log function to it to then arrive at normally distributed values (or 
maybe not). Figure 13-4 plots both the log-normally distributed end-of-period values 


and the transformed ones (“log index level”): 


.skew(arr)) 
.skewtest(arr)[1]) 
.kurtosis(arr)) 
.kurtosistest(arr)[1]) 
.normaltest(arr)[1]) 


In [21]: f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 6)) 


ax1.hist(paths[-1], bins=30) 
ax1.set_xlabel('index level') 
ax1.set_ylabel('frequency') 
ax1.set_title('regular data') 
ax2.hist(np.log(paths[-1]), bins=30) 
ax2.set_xlabel('log index level') 
ax2.set_title('log data') 
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Figure 13-4. Histogram of simulated end-of-period index levels for geometric Brownian 


motion 


The statistics for the data set show expected behavior—for example, a mean value 
close to 105. The log index level values have skew and kurtosis values close to zero 
and they show high p-values, providing strong support for the normal distribution 


hypothesis: 


In [22]: print_statistics(paths[-1]) 


statistic 


size 

min 

max 
mean 
std 

skew 
kurtosis 


250000. 
42. 
233% 
105. 
21. 

0; 

0. 


value 


00000 
74870 
58435 
12645 
23174 
61116 
65182 


In [23]: print_statistics(np. log(paths[-1])) 


statistic 


size 

min 

max 
mean 

std 

skew 
kurtosis 


250000. 
3. 


-0. 
=O, 


value 


00000 
75534 


-45354 
-63517 
. 19998 


00092 
00327 
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In [24]: normality_tests(np.log(paths[-1])) 


Skew of data set -0.001 
Skew test p-value 0.851 
Kurt of data set -0.003 
Kurt test p-value 0.744 
Norm test p-value 0.931 


Figure 13-5 compares again the frequency distribution with the PDF of the normal 
distribution, showing a pretty good fit (as now is, of course, to be expected): 


In [25]: plt.figure(figsize=(10, 6)) 

log_data = np.log(paths[-1]) 

plt.hist(log_data, bins=70, density=True, 
label='observed', color='b') 

plt.xlabel('index levels') 

plt.ylabel('frequency') 

x = np.Linspace(plt.axis()[0], plt.axis()[1]) 

plt.plot(x, scs.norm.pdf(x, log_data.mean(), log_data.std()), 
'r', lw=2.0, label='pdf') 

plt.legend(); 
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Figure 13-5. Histogram of log index levels of geometric Brownian motion and normal 
density function 


Figure 13-6 also supports the hypothesis that the log index levels are normally 
distributed: 


In [26]: sm.qqplot(log_data, line='s') 
plt.xlabel('theoretical quantiles') 
plt.ylabel('sample quantiles'); 
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Figure 13-6. Quantile-quantile plot for log index levels of geometric Brownian motion 


Normality 


The normality assumption with regard to the uncertain returns of 
financial instruments is central to a number of financial theories. 
Python provides efficient statistical and graphical means to test 
whether time series data is normally distributed or not. 


Real-World Data 


This section analyzes four historical financial time series, two for technology stocks 
and two for exchange traded funds (ETFs): 

e APPL.O: Apple Inc. stock price 

e MSFT.O: Microsoft Inc. stock price 

e SPY: SPDR S&P 500 ETF Trust 

e GLD: SPDR Gold Trust 
The data management tool of choice is pandas (see Chapter 8). Figure 13-7 shows the 
normalized prices over time: 


In [27]: import pandas as pd 
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In [28]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True).dropna() 


In [29]: symbols = ['SPY', 'GLD', 'AAPL.O', 'MSFT.O'] 


In [30]: data 
data 


raw[symbols] 
data.dropna() 


In [31]: data.info() 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29 
Data columns (total 4 columns): 
SPY 2138 non-null float64 
GLD 2138 non-null float64 
AAPL.O 2138 non-null float64 
MSFT.O 2138 non-null float64 
dtypes: float64(4) 
Memory usage: 83.5 KB 


In [32]: data.head() 

Out [32]: SPY GLD AAPL.O MSFT.O 
Date 
2010-01-04 113.33 109.80 30.572827 30.950 
2010-01-05 113.63 109.70 30.625684 30.960 
2010-01-06 113.71 111.51 30.138541 30.770 
2010-01-07 114.19 110.82 30.082827 30.452 
2010-01-08 114.57 111.37 30.282827 30.660 


In [33]: (data / data.iloc[0] * 100).plot(figsize=(10, 6)) 
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Figure 13-7. Normalized prices of financial instruments over time 
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Figure 13-8 shows the log returns of the financial instruments as histograms: 


In [34]: log_returns = np.log(data / data.shift(1)) 
log_returns.head() 

Out[34]: SPY GLD  AAPL.O  MSFT.O 
Date 
2010-01-04 NaN NaN NaN NaN 
2010-01-05 0.002644 -0.000911 0.001727 0.000323 
2010-01-06 0.000704 0.016365 -0.016034 -0.006156 
2010-01-07 0.004212 -0.006207 -0.001850 -0.010389 
2010-01-08 0.003322 0.004951 0.006626 0.006807 


In [35]: log_returns.hist(bins=50, figsize=(10, 8)); 
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Figure 13-8. Histograms of log returns for financial instruments 


As a next step, consider the different statistics for the time series data sets. The kurto- 
sis values seem to be especially far from normal for all four data sets: 


In [36]: for sym in symbols: 
print('\nResults for symbol {}'.format(sym)) 
print(30 * '-') 
log_data = np.array(log_returns[sym].dropna()) 
print_statistics(log_data) 
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Results for symbol SPY 


mean 
std 

skew 
kurtosis 


2137.00000 
-0.06734 
0.04545 
0.00041 
0.00933 
-0,52189 
4.52432 


Results for symbol GLD 


mean 
std 

skew 
kurtosis 


2137 . 00000 
-0.09191 
0.04795 
0.00004 
0.01020 
-0.59934 
5.68423 


Results for symbol AAPL.O 


mean 
std 

skew 
kurtosis 


2137 . 00000 
-0.13187 
0.08502 
0.00084 
0.01591 
-0,23510 
4.78964 


Results for symbol MSFT.O 


statistic value 
size 2137.00000 
min -0.12103 
max 0.09941 
mean 0.00054 
std 0.01421 
skew -0.09117 
kurtosis 7.29106 


@ Statistics for time series of financial instruments. 
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Figure 13-9 shows the QQ plot for the SPY ETF. Obviously, the sample quantile val- 
ues do not lie on a straight line, indicating “non-normality.” On the left and right 
sides there are many values that lie well below the line and well above the line, respec- 
tively. In other words, the time series data exhibits fat tails. This term refers to a (fre- 
quency) distribution where large negative and positive values are observed more 
often than a normal distribution would imply. The same conclusions can be drawn 
from Figure 13-10, which presents the data for the Microsoft stock. There also seems 
to be evidence for a fat-tailed distribution: 


In [37]: sm.qqplot(log_returns['SPY'].dropna(), line='s') 
plt.title('SPY') 
plt.xlabel('theoretical quantiles') 
plt.ylabel('sample quantiles'); 

In [38]: sm.qqplot(log_returns['MSFT.O'].dropna(), line='s') 
plt.title('MSFT.O') 
plt.xlabel('theoretical quantiles') 
plt.ylabel('sample quantiles'); 
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Figure 13-9. Quantile-quantile plot for SPY log returns 
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Figure 13-10. Quantile-quantile plot for MSFT.O log returns 


This finally leads to the statistical normality tests: 


In [39]: 


for sym in symbols: 


print('\nResults for symbol {}'.format(sym) ) 


print(32 * '-') 


log_data = np.array(log_returns[sym].dropna()) 


Normality_tests(log_data) 


Results for symbol SPY 


Skew of data set 
Skew test p-value 
Kurt of data set 
Kurt test p-value 
Norm test p-value 


Results for symbol GLD 


Skew of data set 
Skew test p-value 
Kurt of data set 
Kurt test p-value 
Norm test p-value 


Results for symbol AAPL.O 
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Skew of data set -0.235 
Skew test p-value 0.000 
Kurt of data set 4.790 
Kurt test p-value 0.000 
Norm test p-value 0.000 


Results for symbol MSFT.O 


Skew of data set 0 
Skew test p-value 0 
Kurt of data set We sl 
Kurt test p-value 0 
Norm test p-value 0 


© Normality test results for the times series of the financial instruments. 


The p-values of the different tests are all zero, strongly rejecting the test hypothesis that 
the different sample data sets are normally distributed. This shows that the normal 
assumption for stock market returns and other asset classes—as, for example, 
embodied in the geometric Brownian motion model—cannot be justified in general 
and that one might have to use richer models that are able to generate fat tails (e.g., 
jump diffusion models or models with stochastic volatility). 


Portfolio Optimization 


Modern or mean-variance portfolio theory is a major cornerstone of financial theory. 
Based on this theoretical breakthrough the Nobel Prize in Economics was awarded to 
its inventor, Harry Markowitz, in 1990. Although formulated in the 1950s, it is still a 
theory taught to finance students and applied in practice today (often with some 
minor or major modifications).* This section illustrates the fundamental principles of 
the theory. 


Chapter 5 in the book by Copeland, Weston, and Shastri (2005) provides an intro- 
duction to the formal topics associated with MPT. As pointed out previously, the 
assumption of normally distributed returns is fundamental to the theory: 


By looking only at mean and variance, we are necessarily assuming that no other statis- 
tics are necessary to describe the distribution of end-of-period wealth. Unless investors 
have a special type of utility function (quadratic utility function), it is necessary to 
assume that returns have a normal distribution, which can be completely described by 
mean and variance. 


3 See Markowitz (1952). 
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The Data 


The analysis and examples that follow use the same financial instruments as before. 
The basic idea of MPT is to make use of diversification to achieve a minimal portfolio 
risk given a target return level or a maximum portfolio return given a certain level of 
risk. One would expect such diversification effects for the right combination of a 
larger number of assets and a certain diversity in the assets. However, to convey the 
basic ideas and to show typical effects, four financial instruments shall suffice. 
Figure 13-11 shows the frequency distribution of the log returns for the financial 


instruments: 


In [40]: 
In [41]: 
In [42]: 
In [43]: 
In [44]: 


o 
(2) 


symbols = ['AAPL.O', 'MSFT.O', 'SPY', 'GLD'] 


noa = len(symbols) (2) 


data = 


rets = 


raw[symbols ] 


np.log(data / data.shift(1)) 


rets.hist(bins=40, figsize=(10, 8)); 


Four financial instruments for portfolio composition. 


Number of financial instruments defined. 


The covariance matrix for the financial instruments to be invested in is the central 
piece of the portfolio selection process. pandas has a built-in method to generate the 
covariance matrix on which the same scaling factor is applied: 


In [45]: rets.mean() * 252 (1) 


Out[45]: 


In [46]: 
Out[46]: 


AAPL.O 
MSFT .O 
SPY 
GLD 
dtype: 


rets.cov() * 252 


AAPL.O 
MSFT.O 
SPY 
GLD 


0.212359 
0.136648 
0.102928 
0.009141 
float64 


(2) 
MSFT. 0 
0.023427 
0.023427 0.050917 
0.021039 0.022244 
0.001513 -0.000347 


AAPL.O 
0.063773 


Annualized mean returns. 


Annualized covariance matrix. 


SPY GLD 
0.021039 0.001513 
0.022244 -0.000347 
0.021939 0.000062 
0.000062 0.026209 
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Figure 13-11. Histograms of log returns of financial instruments 


The Basic Theory 


In what follows, it is assumed that an investor is not allowed to set up short positions 
in a financial instrument. Only long positions are allowed, which implies that 100% 
of the investor’s wealth has to be divided among the available instruments in such a 
way that all positions are long (positive) and that the positions add up to 100%. Given 
the four instruments, one could, for example, invest equal amounts into every such 
instrument—i.e., 25% of the available wealth in each. The following code generates 
four uniformly distributed random numbers between 0 and 1 and then normalizes 
the values such that the sum of all values equals 1: 


In [47]: weights = np.random.random(noa) @ 
weights /= np.sum(weights) e 


In [48]: weights 
Out[48]: array([0.07650728, 0.06021919, 0.63364218, 0.22963135]) 


In [49]: weights.sum() 
Out[49]: 1.0 
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@ Random portfolio weights ... 


@ ... normalized to 1 or 100%. 


As verified here, the weights indeed add up to 1; i.e., £; w; = 1, where I is the number 
of financial instruments and w, > 0 is the weight of financial instrument i. Equation 
13-1 provides the formula for the expected portfolio return given the weights for the 
single instruments. This is an expected portfolio return in the sense that historical 
mean performance is assumed to be the best estimator for future (expected) perfor- 
mance. Here, the r, are the state-dependent future returns (vector with return values 
assumed to be normally distributed) and u; is the expected return for instrument i. 
Finally, w” is the transpose of the weights vector and is the vector of the expected 
security returns. 


Equation 13-1. General formula for expected portfolio return 
lp = EZ wir,) 
= LwE(r) 
I 
=. 2 Wilh; 
J 
Translated into Python this boils down to a single line of code including annualiza- 


tion: 


In [50]: np.sum(rets.mean() * weights) * 252 1) 
Out[50]: 0.09179459482057793 


@ Annualized portfolio return given the portfolio weights. 


The second object of importance in MPT is the expected portfolio variance. The cova- 
riance between two securities is defined by o; = o; = E(r; — u;)(r; - 44). The variance 
of a security is the special case of the covariance with itself: o? = E((r; - u,)’). 
Figure 13-12 provides the covariance matrix for a portfolio of securities (assuming an 
equal weight of 1 for every security). 
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Figure 13-12. Portfolio covariance matrix 


Equipped with the portfolio covariance matrix, Equation 13-2 then provides the for- 
mula for the expected portfolio variance. 


Equation 13-2. General formula for expected portfolio variance 
o = E((r = u)’) 
= 2d WW; 0; 


i€I jel 


= w'Sw 


In Python, this all again boils down to a single line of code, making heavy use of 
NumPy vectorization capabilities. The np.dot() function gives the dot product of two 
vectors/matrices. The T attribute or transpose() method gives the transpose of a 
vector or matrix. Given the portfolio variance, the (expected) portfolio standard devi- 
ation or volatility o, = y0; is then only one square root away: 


In [51]: np.dot(weights.T, np.dot(rets.cov() * 252, weights)) (1) 
Out[51]: 0.014763288666485574 


In [52]: math.sqrt(np.dot(weights.T, np.dot(rets.cov() * 252, weights))) (2) 
Out[52]: 0.12150427427249452 


@ Annualized portfolio variance given the portfolio weights. 


@ Annualized portfolio volatility given the portfolio weights. 


Python and Vectorization 


The MPT example shows how efficient it is with Python to trans- 
late mathematical concepts, like portfolio return or portfolio var- 
iance, into executable, vectorized code (an argument made in 
Chapter 1). 
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This mainly completes the tool set for mean-variance portfolio selection. Of para- 
mount interest to investors is what risk-return profiles are possible for a given set of 
financial instruments, and their statistical characteristics. To this end, the following 
implements a Monte Carlo simulation (see Chapter 12) to generate random portfolio 
weight vectors on a larger scale. For every simulated allocation, the code records the 
resulting expected portfolio return and variance. To simplify the code, two functions, 
port_ret() and port_vol(), are defined: 


In [53]: def port_ret(weights): 
return np.sum(rets.mean() * weights) * 252 


In [54]: def port_vol(weights): 
return np.sqrt(np.dot(weights.T, np.dot(rets.cov() * 252, weights))) 


In [55]: prets = [] 

pvols = [] 

for p in range (2500): (1) 
weights = np.random.random(noa) 1) 
weights /= np.sum(weights) (1) 
prets.append(port_ret(weights)) (2) 
pvols.append(port_vol(weights)) (2) 

prets = np.array(prets) 

pvols = np.array(pvols) 


@ Monte Carlo simulation of portfolio weights. 


© Collects the resulting statistics in List objects. 


Figure 13-13 illustrates the results of the Monte Carlo simulation. In addition, it pro- 
l-r 
i.e., the expected excess 


vides results for the Sharpe ratio, defined as SR = 


p 
return of the portfolio over the risk-free short rate r; divided by the expected standard 
deviation of the portfolio. For simplicity, r, = 0 is assumed: 


In [56]: plt.figure(figsize=(10, 6)) 
plt.scatter(pvols, prets, c=prets / pvols, 
marker='0', cmap='coolwarn') 
plt.xlabel('expected volatility') 
plt.ylabel('expected return’) 
plt.colorbar(label='Sharpe ratio'); 
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Figure 13-13. Expected return and volatility for random portfolio weights 


It is clear by inspection of Figure 13-13 that not all weight distributions perform well 
when measured in terms of mean and volatility. For example, for a fixed risk level of, 
say, 15%, there are multiple portfolios that all show different returns. As an investor, 
one is generally interested in the maximum return given a fixed risk level or the mini- 
mum risk given a fixed return expectation. This set of portfolios then makes up the 
so-called efficient frontier. This is derived later in this section. 


Optimal Portfolios 


This minimization function is quite general and allows for equality constraints, 
inequality constraints, and numerical bounds for the parameters. 


First, the maximization of the Sharpe ratio. Formally, the negative value of the Sharpe 
ratio is minimized to derive at the maximum value and the optimal portfolio compo- 
sition. The constraint is that all parameters (weights) add up to 1. This can be formu- 
lated as follows using the conventions of the minimize() function.* The parameter 
values (weights) are also bound to be between 0 and 1. These values are provided to 
the minimization function as a tuple of tuples. 


4 An alternative to np.sum(x) - 1 would be to write np.sum(x) == 1, taking into account that with Python the 
Boolean True value equals 1 and the False value equals 0. 
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The only input that is missing for a call of the optimization function is a starting 
parameter list (initial guess for the weights vector). An equal distribution of weights 


will do: 


In [57]: 


In [58]: 


In [59]: 
In [60]: 
In [61]: 
Out[61]: 


In [62]: 
Out[62]: 


© 8 8 


import scipy.optimize as sco 


def min_func_sharpe(weights): (1) 
return -port_ret(weights) / port_vol(weights) (1) 


cons = ({'type': 'eq', 'fun': lambda x: np.sum(x) - 1}) (2) 


bnds = tuple((0, 1) for x in range(noa)) © 


I 


eweights = np.array(noa * [1. / noa,]) (4) 
eweights 
array([0.25, 0.25, 0.25, 0.251) 


min_func_sharpe(eweights) 
-0.8436203363155397 


Function to be minimized. 
Equality constraint. 


Bounds for the parameters. 


Equal weights vector. 


Calling the function returns more than just the optimal parameter values. The results 
are stored in an object called opts. The main interest lies in getting the optimal port- 
folio composition. To this end, one can access the results object by providing the key 
of interest; i.e., x in this case: 


In [63]: 


In [64]: 
Out[64]: 


%%time 

opts = sco.minimize(min_func_sharpe, eweights, 
method='SLSQP', bounds=bnds, 
constraints=cons) 

CPU times: user 67.6 ms, sys: 1.94 ms, total: 69.6 ms 

Wall time: 75.2 ms 


opts @ 
fun: -0.8976673894052725 
jac: array([ 8.96826386e-05, 8.30739737e-05, -2.45958567e-04, 
1.92895532e-05]) 
message: ‘Optimization terminated successfully. ' 
nfev: 36 
nit: 6 
njev: 6 
status: 0 
success: True 
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© © 6 8 8 


x: array([0.51191354, 0.19126414, 0.25454109, 0.04228123]) 


In [65]: opts['x'].round(3) © 
Out[65]: array([0.512, 0.191, 0.255, 0.042]) 


In [66]: port_ret(opts['x']).round(3) (4) 
Out[66]: 0.161 


In [67]: port_vol(opts['x']).round(3) (5) 
Out[67]: 0.18 


In [68]: port_ret(opts['x']) / port_vol(opts['x']) Q 
Out[68]: 0.8976673894052725 


The optimization (i.e., minimization of function min_func_sharpe()). 
The results from the optimization. 

The optimal portfolio weights. 

The resulting portfolio return. 

The resulting portfolio volatility. 


The maximum Sharpe ratio. 


Next, the minimization of the variance of the portfolio. This is the same as minimiz- 
ing the volatility: 


In [69]: optv = sco.minimize(port_vol, eweights, 
method='SLSQP', bounds=bnds, 
constraints=cons) 


In [70]: optv 
Out[70]: fun: 0.1094215526341138 
jac: array([0.11098004, 0.10948556, 0.10939826, 0.10944918]) 
message: ‘Optimization terminated successfully. ' 
nfev: 54 
nit: 9 
njev: 9 
status: 0 
success: True 
x: array([1.62630326e-18, 1.06170720e-03, 5.43263079e-01, 
4.55675214e-01]) 


In [71]: optv['x'].round(3) 
Out[71]: array([0. , 0.001, 0.543, 0.456]) 


In [72]: port_vol(optv['x']).round(3) 
Out[72]: 0.109 
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In [73]: port_ret(optv['x']).round(3) 
Out[73]: 0.06 


In [74]: port_ret(optv['x']) / port_vol(optv['x']) 
Out[74]: 0.5504173653075624 


© The minimization of the portfolio volatility. 


This time, the portfolio is made up of only three financial instruments. This portfolio 
mix leads to the so-called minimum volatility or minimum variance portfolio. 


Efficient Frontier 


The derivation of all optimal portfolios—i.e., all portfolios with minimum volatility 
for a given target return level (or all portfolios with maximum return for a given risk 
level)—is similar to the previous optimizations. The only difference is that one has to 
iterate over multiple starting conditions. 


The approach taken is to fix a target return level and to derive for each such level 
those portfolio weights that lead to the minimum volatility value. For the optimiza- 
tion, this leads to two conditions: one for the target return level, tret, and one for the 
sum of the portfolio weights as before. The boundary values for each parameter stay 
the same. When iterating over different target return levels (trets), one condition for 
the minimization changes. That is why the constraints dictionary is updated during 
every loop: 
In [75]: cons = ({'type': 'eq', 'fun': lambda x: port_ret(x) - tret}, 
{'type': 'eq', 'fun': Lambda x: np.sum(x) - 1}) (1) 


In [76]: bnds 


tuple((0, 1) for x in weights) 


In [77]: %%time 

trets = np.linspace(0.05, 0.2, 50) 

tvols = [] 

for tret in trets: 
res = sco.minimize(port_vol, eweights, method='SLSQP', 

bounds=bnds, constraints=cons) 

tvols.append(res['fun']) 

tvols = np.array(tvols) 

CPU times: user 2.6 s, sys: 13.1 ms, total: 2.61 s 

Wall time: 2.66 s 


@ The two binding constraints for the efficient frontier. 


© The minimization of portfolio volatility for different target returns. 


Figure 13-14 shows the optimization results. The thick line indicates the optimal 
portfolios given a certain target return; the dots are, as before, the random portfolios. 
In addition, the figure shows two larger stars, one for the minimum volatility/ 
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variance portfolio (the leftmost portfolio) and one for the portfolio with the maxi- 
mum Sharpe ratio: 


In [78]: plt.figure(figsize=(10, 6)) 
plt.scatter(pvols, prets, c=prets / pvols, 
marker='.', alpha=0.8, cmap='coolwarm') 
plt.plot(tvols, trets, 'b', lw=4.0) 
plt.plot(port_vol(opts['x']), port_ret(opts['x']), 
'y*', markersize=15.0) 
plt.plot(port_vol(optv['x']), port_ret(optv['x']), 
'r*', markersize=15.0) 
plt.xlabel('expected volatility') 
plt.ylabel('expected return') 
plt.colorbar(label='Sharpe ratio') 
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Figure 13-14. Minimum risk portfolios for given return levels (efficient frontier) 


The efficient frontier is comprised of all optimal portfolios with a higher return than 
the absolute minimum variance portfolio. These portfolios dominate all other portfo- 
lios in terms of expected returns given a certain risk level. 


Capital Market Line 


In addition to risky financial instruments like stocks or commodities (such as gold), 
there is in general one universal, riskless investment opportunity available: cash or 
cash accounts. In an idealized world, money held in a cash account with a large bank 
can be considered riskless (e.g., through public deposit insurance schemes). The 
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downside is that such a riskless investment generally yields only a small return, some- 
times close to zero. 


However, taking into account such a riskless asset enhances the efficient investment 
opportunity set for investors considerably. The basic idea is that investors first deter- 
mine an efficient portfolio of risky assets and then add the riskless asset to the mix. 
By adjusting the proportion of the investor’s wealth to be invested in the riskless asset 
it is possible to achieve any risk-return profile that lies on the straight line (in the 
risk-return space) between the riskless asset and the efficient portfolio. 


Which efficient portfolio (out of the many options) is to be taken to invest in opti- 
mally? It is the one portfolio where the tangent line of the efficient frontier goes 
exactly through the risk-return point of the riskless portfolio. For example, consider a 
riskless interest rate of r; = 0.01. The portfolio is to be found on the efficient frontier 
for which the tangent goes through the point (op, 7; ) = (0, 0.01) in risk-return space. 


For the calculations that follow, a functional approximation and the first derivative 
for the efficient frontier are used. Cubic splines interpolation provides such a differ- 
entiable functional approximation (see Chapter 11). For the spline interpolation, only 
those portfolios from the efficient frontier are used. Via this numerical approach it is 
possible to define a continuously differentiable function f(x) for the efficient frontier 
and the respective first derivative function df (x): 


In [79]: import scipy.interpolate as sci 


In [80]: ind = np.argmin(tvols) (1) 
evols = tvols[ind: ] (2) 
erets = trets[ind:] (2) 


In [81]: tck = sci.splrep(evols, erets) © 


In [82]: def f(x): 
''' Efficient frontier function (splines approximation). 
return sci.splev(x, tck, der=0) 
def df(x): 
''' First derivative of efficient frontier function. 
return sci.splev(x, tck, der=1) 


tet 


Pes 


@ Index position of minimum volatility portfolio. 
@ Relevant portfolio volatility and return values. 


© Cubic splines interpolation on these values. 


What is now to be derived is a linear function t(x) = a+ b- x representing the line 
that passes through the riskless asset in risk-return space and that is tangent to the 
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efficient frontier. Equation 13-3 describes all three conditions that the function t(x) 
needs to satisfy. 


Equation 13-3. Mathematical conditions for capital market line 


t(x) = atb-x 

(0) = r = a = rf 
t(x) = f(x) = atb-x = f(x) 
tœ) = f(x) e b = f (x) 


Since there is no closed formula for the efficient frontier or the first derivative of it, 
one has to solve the system of equations in Equation 13-3 numerically. To this end, 
define a Python function that returns the values of all three equations given the 
parameter set p = (a, b, x). 


The function sco. fsolve() from scipy.optimize is capable of solving such a system 
of equations. In addition to the function equations(), an initial parameterization is 
provided. Note that success or failure of the optimization might depend on the initial 
parameterization, which therefore has to be chosen carefully—generally by a combi- 
nation of educated guesses with trial and error: 


In [83]: def equations(p, rf=0.01): 
eqi = rf - p[o] @ 
eq2 = rf + p[1] * p[2] - f(p[2]) @ 
eq3 = p[1] - df(p[2]) 
return eqi, eq2, eq3 


In [84]: opt = sco.fsolve(equations, [0.01, 0.5, 0.15]) (2) 


In [85]: opt © 
Out[85]: array([0.01 , 0.84470952, 0.19525391]) 


In [86]: np.round(equations(opt), 6) (4) 
Out[86]: array([ 0., 0., -0.]) 


The equations describing the capital market line (CML). 


© 


Solving these equations for given initial values. 


© 


The optimal parameter values. 


The equation values are all zero. 


Figure 13-15 presents the results graphically; the star represents the optimal portfolio 
from the efficient frontier for which the tangent line passes through the riskless asset 
point (0, r;= 0.01): 
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In [87]: plt.figure(figsize=(10, 6)) 

plt.scatter(pvols, prets, c=(prets - 0.01) / pvols, 
marker='.', cmap='coolwarm') 

plt.plot(evols, erets, 'b', lw=4.0) 
cx = np.linspace(0.0, 0.3) 
plt.plot(cx, opt[0] + opt[1] * cx, 'r', lw=1.5) 
plt.plot(opt[2], f(opt[2]), 'y*', markersize=15.0) 
plt.grid(True) 
plt.axhline(@, color='k', ls='--', lw=2.0) 
plt.axvline(0, color='k', ls='--', lw=2.0) 
plt.xlabel('expected volatility') 
plt.ylabel('expected return') 
plt.colorbar(label='Sharpe ratio') 
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Figure 13-15. Capital market line and tangent portfolio (star) for risk-free rate of 1% 


The portfolio weights of the optimal (tangent) portfolio are as follows. Only three of 
the four assets are in the mix: 


In [88]: cons = ({'type': 'eq', 'fun': lambda x: port_ret(x) - f(opt[2])}, 
{'type': 'eq', 'fun': Lambda x: np.sum(x) - 1}) 
res = sco.minimize(port_vol, eweights, method='SLSQP', 
bounds=bnds, constraints=cons) 


In [89]: res['x'].round(3) (2) 
Out[89]: array([0.59 , 0.221, 0.189, 0. ]) 


In [90]: port_ret(res['x']) 
Out[90]: 0.1749328414905194 
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In [91]: port_vol(res['x']) 
Out[91]: 0.19525371793918325 


In [92]: port_ret(res['x']) / port_vol(res['x']) 
Out[92]: 0.8959257899765407 


@ Binding constraints for the tangent portfolio (gold star in Figure 13-15). 


© The portfolio weights for this particular portfolio. 


Bayesian Statistics 


Bayesian statistics nowadays is widely popular in empirical finance. This chapter can 
for sure not lay the foundations for all concepts of the field. The reader should there- 
fore consult, if needed, a textbook like the one by Geweke (2005) for a general intro- 
duction or Rachev (2008) for one that is financially motivated. 


Bayes’ Formula 


The most common interpretation of Bayes’ formula in finance is the diachronic inter- 
pretation. This mainly states that over time one learns new information about certain 
variables or parameters of interest, like the mean return of a time series. Equation 
13-4 states the theorem formally. 


Equation 13-4. Bayes’s formula 


_ p(H)-p(D | H) 
p(H | D)= 2s 
Here, H stands for an event, the hypothesis, and D represents the data an experiment 
or the real world might present.° On the basis of these fundamental notions, one has: 


pH) 
The prior probability 


p(D) 
The probability for the data under any hypothesis, called the normalizing con- 
stant 

p(D | H) 
The likelihood (i.e., the probability) of the data under hypothesis H 


5 For a Python-based introduction into these and other fundamental concepts of Bayesian statistics, refer to 
Downey (2013). 
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pH |D) 

The posterior probability; i.e., after one has seen the data 
Consider a simple example. There two boxes, B, and B,. Box B, contains 30 black 
balls and 60 red balls, while box B, contains 60 black balls and 30 red balls. A ball is 
randomly drawn from one of the two boxes. Assume the ball is black. What are the 
probabilities for the hypotheses “H,: Ball is from box B,”; and “H,: Ball is from box 
B,” respectively? 


Before the random draw of the the ball, both hypotheses are equally likely. After it is 
clear that the ball is black, one has to update the probability for both hypotheses 
according to Bayes’ formula. Consider hypothesis H;: 


¢ Prior: p(H,) = > 
e Normalizing constant: p(D) = ; . + ; . = - 


+ Likelihood: p(D | H,) = > 


Y-i 
PAE 1 


This gives the updated probability for H, of p(H, | D) = ~> = =. 


This result also makes sense intuitively. The probability of drawing a black ball from 
box B, is twice as high as that of the same event happening with box B,. Therefore, 


having drawn a black ball, the hypothesis H, has with p(H, | D) = ; an updated 
probability two times as high as the updated probability for hypothesis H,. 


Bayesian Regression 


With PyMC3 the Python ecosystem provides a comprehensive package to technically 
implement Bayesian statistics and probabilistic programming. 


Consider the following example based on noisy data around a straight line.‘ First, a 
linear ordinary least-squares regression (see Chapter 11) is implemented on the data 
set, the result of which is visualized in Figure 13-16: 


In [1]: import numpy as np 
import pandas as pd 
import datetime as dt 
from pylab import mpl, plt 


In [2]: plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
np.random.seed(1000) 

%matplotlib inline 


6 Examples originally provided by Thomas Wiecki, one of the main authors of the PyMC3 package. 
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In [3]: x = np.linspace(0, 10, 500) 
y = 442 * x + np.random.standard_normal(len(x)) * 2 


In [4]: reg = np.polyfit(x, y, 1) 


In [5]: reg 
Out[5]: array([2.03384161, 3.77649234]) 


In [6]: plt.figure(figsize=(10, 6)) 
plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') 
plt.plot(x, reg[i] + reg[0] * x, lw=2.0) 
plt.colorbar() 
plt.xlabel('x') 
plt.ylabel('y') 
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Figure 13-16. Sample data points and regression line 


The results of the OLS regression approach are fixed values for the two parameters of 
the regression line (intercept and slope). Note that the highest-order monomial factor 
(in this case, the slope of the regression line) is at index level 0 and that the intercept 
is at index level 1. The original parameters 2 and 4 are not perfectly recovered, but 
this of course is due to the noise included in the data. 


Second, a Bayesian regression making use of the PyMC3 package. Here, it is assumed 
that the parameters are distributed in a certain way. For example, consider the equa- 
tion describing the regression line f(x) = a + f - x. Assume now the following priors: 
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e ais normally distributed with mean 0 and a standard deviation of 20. 


e Bis normally distributed with mean 0 and a standard deviation of 10. 


For the likelihood, assume a normal distribution with a mean of f(x) and a uniformly 
distributed standard deviation of between 0 and 10. 


A major element of Bayesian regression is Markov chain Monte Carlo (MCMC) sam- 
pling.’ In principle, this is the same as drawing balls multiple times from boxes, as in 
the simple example in the previous section—just in a more systematic, automated 
way. 


For the technical sampling, there are three different functions to call: 


e find_MAP() finds the starting point for the sampling algorithm by deriving the 
local maximum a posteriori point. 


e NUTS() implements the so-called “efficient No-U-Turn Sampler with dual aver- 
aging” (NUTS) algorithm for MCMC sampling given the assumed priors. 


e sample() draws a number of samples given the starting value from find_MAP() 
and the optimal step size from the NUTS algorithm. 


All this is to be wrapped into a PyMC3 Model object and executed within a with 
statement: 


In [8]: import pymc3 as pm 


In [9]: %%time 
with pm.Model() as model: 
# model 
alpha = pm.Normal('alpha', mu=0, sd=20) (13 
beta = pm.Normal('beta', mu=0, sd=10) (1) 
sigma = pm.Uniform('sigma', lower=0, upper=10) (1) 
y_est = alpha + beta * x 
likelihood = pm.Normal('y', mu=y_est, sd=sigma, 
observed=y) 


# inference 
start = pm.find_MAP() (4) 
step = pm.NUTS() (5) 
trace = pm.sample(100, tune=1000, start=start, 
progressbar=True) (6) 
logp = -1,067.8, ||grad|| = 60.354: 100% | MMMM) 28/28 [00:00<00:00, 
474.70it/s] 


7 For example, the Monte Carlo algorithms used throughout the book and analyzed in detail in Chapter 12 all 
generate so-called Markov chains, since the immediate next step/value only depends on the current state of 
the process and not on any other historic state or value. 
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O © © O O © 8 8 


Only 100 samples in chain. 
Auto-assigning NUTS sampler... 
Initializing NUTS using jitter+adapt_diag. 


Multiprocess sampling (2 chains in 2 jobs) 


NUTS: [sigma, beta, alpha] 


Sampling 2 chains: 100% || 2200/2200 [00:03<00:00, 


690.96draws/s] 


CPU times: user 6.2 s, sys: 1.72 s, total: 
Wall time: imin 28s 


In [10]: pm.summary(trace) (7) 
Out[10]: 
mean sd mc_error  hpd_2.5 
alpha 3.764027 0.174796 0.013177 3.431739 
beta 2.036318 0.030519 0.002230 1.986874 
sigma 2.010398 0.058663 0.004517 1.904395 
In [11]: trace[0] 8] 
Out[11]: {'alpha': 3.9303300798212444, 
"beta': 2.0020264758995463, 
"sigma_interval__': -1.3519315719461853, 
"sigma': 2.0555476283253156} 


Defines the priors. 

Specifies the linear regression. 

Defines the likelihood. 

Finds the starting value by optimization. 
Instantiates the MCMC algorithm. 

Draws posterior samples using NUTS. 
Shows summary statistics from samplings. 


Estimates from the first sample. 


1.92 S 


hpd_97.5 n_eff 


Rhat 


4.070091 152.446951 0.996281 
2.094008 106.505590 0.999155 
2.138187 188.643293 0.998547 


The three estimates shown are rather close to the original values (4, 2, 2). However, 
the whole procedure yields more estimates. They are best illustrated with the help of 
a trace plot, as in Figure 13-17—i.e., a plot showing the resulting posterior distribu- 
tion for the different parameters as well as all single estimates per sample. The poste- 


rior distribution gives an intuitive sense about the uncertainty in the estimates: 


In [12]: pm.traceplot(trace, lines={'alpha': 4, 'beta': 2, 'sigma': 2}); 
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Figure 13-17. Posterior distributions and trace plots 


Taking only the alpha and beta values from the regression, one can draw all result- 
ing regression lines as shown in Figure 13-18: 


In [13]: plt.figure(figsize=(10, 6)) 
plt.scatter(x, y, c=y, marker='v', cmap='coolwarm') 
plt.colorbar() 
plt.xlabel('x') 
plt.ylabel('y') 
for i in range(len(trace)): 
plt.plot(x, trace['alpha'][i] + trace['beta'][i] * x) 1] 


@ Plots single regression lines. 
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Figure 13-18. Regression lines based on the different estimates 


Two Financial Instruments 


Having introduced Bayesian regression with PyMC3 based on dummy data, the move 
to real financial data is straightforward. The example uses financial time series data 
for the two exchange traded funds (ETFs) GLD and GDX (see Figure 13-19): 


In [14]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True) 


In [15]: data = raw[['GDX', 'GLD']].dropna() 


In [16]: data = data / data.iloc[o] @® 


In [17]: data.info() 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 2138 entries, 2010-01-04 to 2018-06-29 
Data columns (total 2 columns): 
GDX 2138 non-null float64 
GLD 2138 non-null float64 
dtypes: float64(2) 
memory usage: 50.1 KB 


In [18]: data.iloc[-1] / data.iloc[0] - 1 (2) 
Out[18]: GDX -0.532383 

GLD 0.080601 

dtype: float64 
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In [19]: data.corr() © 
Out[19]: GDX GLD 
GDX 1.00000 0.71539 
GLD 0.71539 1.00000 
In [20]: data.plot(figsize=(10, 6)); 
Normalizes the data to a starting value of 1. 


Calculates the relative performances. 


Calculates the correlation between the two instruments. 
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Figure 13-19. Normalized prices for GLD and GDX over time 


In what follows, the dates of the single data points are visualized in scatter plots. To 


this 


agai 


end, the DatetimeIndex object of the DataFrame is transformed to matplotlib 
dates. Figure 13-20 shows a scatter plot of the time series data, plotting the GLD values 


nst the GDX values and illustrating the dates of each data pair by different color- 


ings:* 


In [21]: data.index[:3] 
Out[21]: DatetimeIndex(['2010-01-04', '2010-01-05', '2010-01-06'], 
dtype='datetime64[ns]', name='Date', freq=None) 


8 Note all visualizations here are based on normalized price data and not, as might be better in real-world 
applications, on return data, for instance. 
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In [22]: mpl_dates = mpl.dates.date2num(data. index. to_pydatetime()) (13 
mpl_dates[:3] 
Out[22]: array([733776., 733777., 733778.]) 


In [23]: plt.figure(figsize=(10, 6)) 
plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, 
marker='0', cmap='coolwarm') 
plt.xlabel('GDX') 
plt.ylabel('GLD') 
plt.colorbar(ticks=mp1l.dates.DayLocator(interval=250), 
format=mpl.dates.DateFormatter('%d %b %y')); (2) 


Converts the DatetimeIndex object to matplotlib dates. 


Customizes the color bar for the dates. 
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Figure 13-20. Scatter plot of GLD prices against GDX prices 


The following code implements a Bayesian regression on the basis of these two time 
series. The parameterizations are essentially the same as in the previous example with 
dummy data. Figure 13-21 shows the results from the MCMC sampling procedure 
given the assumptions about the prior probability distributions for the three 
parameters: 


In [24]: with pm.Model() as model: 
alpha = pm.Normal('alpha', mu=0, sd=20) 
beta = pm.Normal('beta', mu=0, sd=20) 
sigma = pm.Uniform('sigma', lower=0, upper=50) 
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y_est = alpha + beta * data['GDX'].values 


Likelihood = pm.Normal('GLD', mu=y_est, sd=sigma, 
observed=data[ 'GLD'].values) 


start = pm.find_MAP() 
step = pm.NUTS() 
trace = pm.sample(250, tune=2000, start=start, 
progressbar=True) 
logp = 1,493.7, ||grad|| = 188.29: 100% | MM) 27/27 [00:00<00:00, 
1609.34it/s] 
Only 250 samples in chain. 
Auto-assigning NUTS sampler... 
Initializing NUTS using jitter+adapt_diag... 
Multiprocess sampling (2 chains in 2 jobs) 
NUTS: [sigma, beta, alpha] 
Sampling 2 chains: 100% | RM) 4500/4500 [00:09<00:00, 
465.07draws/s] 
The estimated number of effective samples is smaller than 200 for some 
parameters. 


In [25]: pm.summary(trace) 


Out[25]: 
mean sd mc_error hpd_2.5 hpd_97.5 n_eff Rhat 
alpha 0.913335 0.005983 0.000356 0.901586 0.924714 184.264900 1.001855 
beta 0.385394 0.007746 0.000461 0.369154 0.398291 215.477738 1.001570 
Sigma 0.119484 0.001964 0.000098 0.115305 0.123315 312.260213 1.005246 
In [26]: fig = pm.traceplot(trace) 
alpha alpha 
p” 2 092 | H 
2 E AENOR hoa a al 
A vif 
E25 5 0.90 i 
j 0.890 0.895 0.900 0.905 0.910 0.915 0.920 0.925 0.930 0 50 100 150 200 250 
beta beta 
60 
B10 Š 0.40 ill} ni j 
g v LAM Wh ier Ak FMA TIAA OA 
Bas oan AMYUNI AI MUN A 
0 ; 0.36 
0.36 0.37 0.38 0.39 0.40 0.41 0 50 100 150 200 250 
sigma sigma 
g 3 0.125 | L i | | | ’ 
2-100 g 0120 i AN Vi IN it \ lM Wy if ý Y W if ih i ii 
g E y | | 
[2 a | 
v 0.115 
9 0.114 0.116 0.118 0.120 0.122 0.124 0.126 0 50 100 150 250 


Figure 13-21. Posterior distributions and trace plots for GDX and GLD data 
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Figure 13-22 adds all the resulting regression lines to the scatter plot from before. 
However, all the regression lines are pretty close to each other: 


In [27]: plt.figure(figsize=(10, 6)) 

plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, 
marker='0', cmap='coolwarm' ) 

plt.xlabel('GDX') 

plt.ylabel('GLD') 

for i in range(len(trace)): 

plt.plot(data['GDX'], 

trace['alpha'][i] + trace['beta'][i] * data['GDX']) 

plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), 
format=mpl.dates.DateFormatter('%d %b %y')); 
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Figure 13-22. Multiple Bayesian regression lines through GDX and GLD data 


The figure reveals a major drawback of the regression approach used: the approach 
does not take into account evolutions over time. That is, the most recent data is 
treated the same way as the oldest data. 


Updating Estimates over Time 


As pointed out before, the Bayesian approach in finance is generally most useful 
when seen as diachronic—i.e., in the sense that new data revealed over time allows 
for better regressions and estimates through updating or learning. 


To incorporate this concept in the current example, assume that the regression 
parameters are not only random and distributed in some fashion, but that they follow 
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some kind of random walk over time. It is the same generalization used when making 
the transition in financial theory from random variables to stochastic processes 
(which are essentially ordered sequences of random variables). 


To this end, define a new PyMC3 model, this time specifying parameter values as ran- 
dom walks. After having specified the distributions of the random walk parameters, 
one proceeds with specifying the random walks for alpha and beta. To make the 
whole procedure more efficient, 50 data points at a time share common coefficients: 


In [28]: from pymc3.distributions.timeseries import GaussianRandomWalk 


In [29]: subsample_alpha = 50 
subsample_beta = 50 


In [30]: model_randomwalk = pm.Model() 
with model_randomwalk: 

sigma_alpha = pm.Exponential('sig_alpha', 1. / .02, testval=.1) (13 

sigma_beta = pm.Exponential('sig_beta', 1. / .02, testval=.1) (1) 

alpha = GaussianRandomWalk('alpha', sigma_alpha ** -2, 
shape=int(len(data) / subsample_alpha)) (2) 

beta = GaussianRandomWalk('beta', sigma_beta ** -2, 
shape=int(len(data) / subsample_beta) ) (2) 

alpha_r = np.repeat(alpha, subsample_alpha) © 

beta_r = np.repeat(beta, subsample_beta) © 

regression = alpha_r + beta_r * data['GDX'].values[:2100] (4) 

sd = pm.Uniform('sd', 0, 20) 

likelihood = pm.Normal('GLD', mu=regression, sd=sd, 

observed=data['GLD'].values[:2100]) Q 


Defines priors for the random walk parameters. 
Models for the random walks. 
Brings the parameter vectors to interval length. 
Defines the regression model. 


The prior for the standard deviation. 


® 
(2) 
© 
(a) 
© 
(6) 


Defines the likelihood with mu from regression results. 


All these definitions are a bit more involved than before due to the use of random 
walks instead of a single random variable. However, the inference steps with the 
MCMC sampling remain essentially the same. Note, though, that the computational 
burden increases substantially since the algorithm has to estimate parameters per 
random walk sample—i.e., 1,950 / 50 = 39 parameter combinations in this case 
(instead of 1, as before): 
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In [31]: %%time 
import scipy.optimize as sco 
with model_randomwalk: 
start = pm.find_MAP(vars=[alpha, beta], 
fmin=sco.fmin_l_bfgs_b) 
step = pm.NUTS(scaling=start) 
trace_rw = pm.sample(250, tune=1000, start=start, 
progressbar=True) 
logp = -6,657: 2% || | 82/5000 [00:00<00:08, 550.29it/s] 
Only 250 samples in chain. 
Auto-assigning NUTS sampler... 
Initializing NUTS using jitter+adapt_diag... 
Multiprocess sampling (2 chains in 2 jobs) 
NUTS: [sd, beta, alpha, sig_beta, sig_alpha] 
Sampling 2 chains: 100% | MMMM) 2500/2500 [02:48<00:00, 8.59draws/s] 


CPU times: user 27.5 s, sys: 3.68 s, total: 31.2 s 
Wall time: 5min 3s 


In [32]: pm.summary(trace_rw).head() (1) 


Out[32]: 
mean sd mc_error hpd_2.5 hpd_97.5 neff \ 
alpha_0 0.673846 0.040224 0.001376 0.592655 0.753034 1004.616544 
alpha_1 0.424819 0.041257 0.001618 0.348102 0.509757 804.760648 
alpha_2 0.456817 0.057200 0.002011 0.321125 0.553173 800.225916 
alpha__3 0.268148 0.044879 0.001725 0.182744 0.352197 724.967532 
alpha_4 0.651465 0.057472 0.002197 0.544076 0.761216 978.073246 
Rhat 
alpha_0 0.998637 
alpha__1 0.999540 
alpha__2 0.998075 
alpha_3 0.998995 
alpha_4 0.998060 


@ The summary statistics per interval (first five and alpha only). 


Figure 13-23 illustrates the evolution of the regression parameters alpha and beta 
over time by plotting a subset of the estimates: 


In [33]: sh = np.shape(trace_rw['alpha']) (1) 


sh 
Out[33]: (500, 42) 


In [34]: part_dates = np.linspace(min(mpl_dates), 
max(mpl_dates), sh[1]) 2) 


In [35]: index = [dt.datetime.fromordinal(int(date)) for 
date in part_dates] 


In [36]: alpha 


{'alpha_%i' % i: v for i, v in 
enumerate(trace_rw['alpha']) if i < 20} © 
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[37]: beta = {'beta_%i' % i: v for i, v in 
enumerate(trace_rw['beta']) if i < 20} © 


[38]: df_alpha = pd.DataFrame(alpha, index=index) © 
[39]: df_beta = pd.DataFrame(beta, index=index) © 


[40]: ax = df_alpha.plot(color='b', style='-.', legend=False, 
lw=0.7, figsize=(10, 6)) 
df_beta.plot(color='r', style='-.', legend=False, 
lw=0.7, ax=ax) 
plt.ylabel('alpha/beta'); 


@ Shape of the object with parameter estimates. 


© Creates a list of dates to match the number of intervals. 


© Collects the relevant parameter time series in two DataFrame objects. 
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Figure 13-23. Selected parameter estimates over time 


Absolute Price Data Versus Relative Return Data 


The analyses in this section are based on normalized price data. 
This is for illustration purposes only, because the respective graph- 
ical results are easier to understand and interpret (they are visually 
“more appealing”). For real-world financial applications one would 
instead rely on return data, for instance, to ensure stationarity of 
the time series data. 
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Using the mean alpha and beta values, Figure 13-24 illustrates how the regression is 
updated over time. The 39 different regression lines resulting from the mean alpha 
and beta values are displayed. It is obvious that updating over time improves the 
regression fit (for the current/most recent data) significantly—in other words, “every 
time period needs its own regression”: 


In [41]: plt.figure(figsize=(10, 6)) 
plt.scatter(data['GDX'], data['GLD'], c=mpl_dates, 
marker='0', cmap='coolwarn') 
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250), 
format=mpl.dates.DateFormatter('%d %b %y')) 
plt.xlabel('GDX') 
plt.ylabel('GLD') 
x = np.Linspace(min(data['GDX']), max(data['GDX'])) 
for i in range(sh[1]): 
alpha_rw = np.mean(trace_rw['alpha'].T[i]) 
beta_rw = np.mean(trace_rw['beta'].T[i]) 
plt.plot(x, alpha_rw + beta_rw * x, '--', lw=0.7, 
color=plt.cm.coolwarm(i / sh[1])) 


@ Plots the regression lines for all time intervals of length 50. 
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Figure 13-24. Scatter plot with time-dependent regression lines (updated estimates) 


This concludes the section on Bayesian statistics. Python offers with PyMC3 a compre- 
hensive package to implement different approaches from Bayesian statistics and 
probabilistic programming. Bayesian regression in particular is a tool that has 
become quite popular and important in quantitative finance. 
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Machine Learning 


In finance and many other fields, the “name of the game” these days is machine learn- 
ing (ML). As the following quote puts it: 


Econometrics might be good enough to succeed in financial academia (for now), but 
succeeding in practice requires ML. 


—Marcos Lopez de Prado (2018) 


Machine learning subsumes different types of algorithms that are basically able to 
learn on their own certain relationships, patterns, etc. from raw data. “Further 
Resources” on page 463 lists a number of books that can be consulted on the 
mathematical and statistical aspects of machine learning approaches and algorithms 
as well as on topics related to their implementation and practical use. For example, 
Alpaydin (2016) provides a gentle introduction to the field and gives a nontechnical 
overview of the types of algorithms that are typically used. 


This section takes a rigorously practical approach and focuses on selected implemen- 
tation aspects only—with a view on the techniques used in Chapter 15. However, the 
algorithms and techniques introduced can of course be used in many different finan- 
cial areas and not only in algorithmic trading. The section covers two types of algo- 
rithms: unsupervised and supervised learning algorithms. 


One of the most popular packages for machine learning with Python is scikit- 
learn. It not only provides implementations of a great variety of ML algorithms, but 
also provides a large number of helpful tools for pre- and post-processing activities 
related to ML tasks. This section mainly relies on this package. It also uses Tensor 
Flow in the context of deep neural networks (DNNs). 


VanderPlas (2016) provides a concise introduction to different ML algorithms based 
on Python and scikit-learn. Albon (2018) offers a number of recipes for typical 
tasks in ML, also mainly using Python and scikit- learn. 


Unsupervised Learning 


Unsupervised learning embodies the idea that a machine learning algorithm discovers 
insights from raw data without any further guidance. One such algorithm is the k- 
means clustering algorithm that clusters a raw data set into a number of subsets and 
assigns these subsets labels (“cluster 0,” “cluster 1,” etc.). Another one is Gaussian 
mixture? 


9 For more unsupervised learning algorithms available in scikit- learn, see the documentation. 
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The data 


Among other things, scikit-learn allows the creation of sample data sets for differ- 
ent types of ML problems. The following creates a sample data set suited to illustrat- 
ing k-means clustering. 


First, some standard imports and configurations: 


In [1]: import numpy as np 
import pandas as pd 
import datetime as dt 
from pylab import mpl, plt 


In [2]: plt.style.use('seaborn') 
mpL.rcParams['font.family'] = 'serif' 
np.random.seed(1000) 
np.set_printoptions(suppress=True, precision=4) 
%matplotlib inline 


Second, the creation of the sample data set. Figure 13-25 visualizes the sample data: 


In [3]: from sklearn.datasets.samples_generator import make_blobs 


In [4]: X, y = make_blobs(n_samples=250, centers=4, 
random_state=500, cluster_std=1.25) 1] 


In [5]: plt.figure(figsize=(10, 6)) 
plt.scatter(X[:, 0], X[:, 1], s=50); 


@ Creates the sample data set for clustering with 250 samples and 4 centers. 
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Figure 13-25. Sample data for the application of clustering algorithms 
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k-means clustering 


One of the convenient features of scikit-learn is that it provides a standardized 
API to apply different kinds of algorithms. The following code shows the basic steps 
for k-means clustering that are repeated for other models afterwards: 


Importing the model class 
Instantiating a model object 
Fitting the model object to some data 


Predicting the outcome given the fitted model for some data 


Figure 13-26 shows the results: 


In [6]: from sklearn.cluster import KMeans (13 

In [7]: model = KMeans(n_clusters=4, random_state=0) (2) 

In [8]: model.fit(X) © 

Out[8]: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, 
n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto', 
random_state=0, tol=0.0001, verbose=0) 


In [9]: y_kmeans = model.predict(X) (4) 


In [10]: y_kmeans[:12] (5) 
Out[10]: array([1, 1, 0, 3, 0, 1, 3, 3, 3, 0, 2, 2], dtype=int32) 


In [11]: plt.figure(figsize=(10, 6)) 
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='coolwarm'); 


Imports the model class from scikit-learn. 


Instantiates a model object, given certain parameters; knowledge about the sam- 
ple data is used to inform the instantiation. 


Fits the model object to the raw data. 
Predicts the cluster (number) given the raw data. 


Shows some cluster numbers as predicted. 


446 


| Chapter 13: Statistics 


4 
2 
o seie ; 
oO o e] 
-2 À ete? o 7 2 
890%, T aa 0 obong’? p 
a e? : A Soe e x ebo e © 
e e (J s <° ’ 
=6 e ° S 
° 
-8 ad 
=10 
-12 
-10.0 -7.5 -5.0 -2.5 0.0 


7.5 


Figure 13-26. Sample data and identified clusters 


Gaussian mixture 


As an alternative clustering method, consider Gaussian mixture. The application is 
the same, and with the appropriate parameterization, the results are also the same: 


In [12] 


In [13]: 


In [14]: 
Out[14]: 


In [15]: 


In [16]: 
Out[16]: 


In [17]: 
Out[17]: 


: from sklearn.mixture import GaussianMixture 


model. fit(X) 


max_iter=100, 


model = GaussianMixture(n_components=4, random_state=0) 


GaussianMixture(covariance_type='full', init_params='kmeans', 


means_init=None, n_components=4, n_init=1, precisions_init=None, 
random_state=0, reg_covar=1e-06, tol=0.001, verbose=0, 


verbose_interval=10, warm_start=False, weights_init=None) 


y_gm = model.predict(X) 


y_gm[:12] 
array([ 2; 1 9; 3,0, 4... 3; 3; 338; 2; 2)) 


(y_gm == y_kmeans).all() (1) 
True 


@ The results from k-means clustering and Gaussian mixture are the same. 
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Supervised Learning 


Supervised learning is machine learning with some guidance in the form of known 
results or observed data. This means that the raw data already contains what the ML 
algorithm is supposed to learn. In what follows, the focus lies on classification prob- 
lems as opposed to estimation problems. While estimation problems are about the 
estimation of real-valued quantities in general, classification problems are character- 
ized by an effort to assign to a certain feature combination a certain class (integer 
value) from a relatively small set of classes (integer values). 


The examples in the previous subsection showed that with unsupervised learning the 
algorithms come up with their own categorical labels for the clusters identified. With 
four clusters, the labels are 0, 1, 2, and 3. In supervised learning, such categorical 
labels are already given, so that the algorithm can learn the relationship between the 
features and the categories (classes). In other words, during the fitting step, the algo- 
rithm knows the right class for the given feature value combinations. 


This subsection illustrates the application of the following classification algorithms: 
Gaussian Naive Bayes, logistic regression, decision trees, deep neural networks, and 
support vector machines." 


The data 


Again, scikit-learn allows the creation of an appropriate sample data set to apply 
classification algorithms. In order to be able to visualize the results, the sample data 
only contains two real-valued, informative features and a single binary label (a binary 
label is characterized by two different classes only, 0 and 1). The following code cre- 
ates the sample data, shows some extracts of the data, and visualizes the data (see 
Figure 13-27): 


In [18]: from sklearn.datasets import make_classification 
In [19]: n_samples = 100 


In [20]: X, y = make_classification(n_samples=n_samples, n_features=2, 
n_informative=2, n_redundant=0, 
n_repeated=0, random_state=250) 


In [21]: X[:5] © 

Out[21]: array([[ 1.6876, -0.7976], 
[-0.4312, -0.7606], 
[-1.4393, -1.2363], 
[ 1.118 , -1.8682], 
[ 0.0502, 0.659 ]]) 


10 For an overview of the classification algorithms for supervised learning available in scikit- learn, refer to the 
documentation. Note that many of these algorithms are also available for estimation instead of classification. 
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In [22]: X.shape (13 
Out[22]: (100, 2) 


In [23]: y[:5] @ 
Out[23]: array([1, 0, 0, 1, 1]) 


In [24]: y.shape (2) 
Out[24]: (100,) 


plt.figure(figsize=(10, 6)) 
plt.hist(X); 
In [25]: plt.figure(figsize=(10, 6)) 
plt.scatter(x=X[:, 0], y=X[:, 1], c=y, cmap='coolwarm'); 
@ The two informative, real-valued features. 


© The single binary label. 
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Figure 13-27. Sample data for the application of classification algorithms 


Gaussian Naive Bayes 


Gaussian Naive Bayes (GNB) is generally considered to be a good baseline algorithm 
for a multitude of different classification problems. The application is in line with the 
steps outlined in “k-means clustering” on page 446: 


In [26]: from sklearn.naive_bayes import GaussianNB 
from sklearn.metrics import accuracy_score 
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In [27]: 


In [28]: 
Out[28]: 


In [29]: 
Out[29]: 


In [30]: 


In [31]: 
Out[31]: 


In [32]: 
Out[32]: 


model = GaussianNB() 


model.fit(X, y) 
GaussianNB(priors=None, var_smoothing=1e-09) 


model. predict_proba(X).round(4)[:5] @ 

array([[0.0041, 0.9959], 
[0.8534, 0.1466], 
[0.9947, 0.0053], 
[0.0182, 0.9818], 
[0.5156, 0.4844]]) 


pred = model.predict(X) (2) 


pred (2) 


artay( [i By, Gy. 10: 0i By. 1 00 0 Oy. 8). Ls Dy Oy 2, a Ay 


pred == y © 


array([ True, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 


True]) 


True, True, 
True, False, 


In [33]: accuracy_score(y, pred) (4) 


Out[33]: 


0.87 


False, 
True, 
False, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 
False, 


True, 
True, 
False, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 
True, 


0, 
0, 
1, 


True, 
True, 
True, 
True, 
True, 
False, 
True, 
True, 
False, 
True, 
True, 


Shows the probabilities that the algorithm assigns to each class after fitting. 


Based on the probabilities, predicts the binary classes for the data set. 


Compares the predicted classes with the real ones. 


Calculates the accuracy score given the predicted values. 


Figure 13-28 visualizes the correct and false predictions from GNB: 


In [34]: Xc = X[y == pred] (1) 


xf 


X[y != pred] 
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In [35]: plt.figure(figsize=(10, 6)) 
plt.scatter(x=Xc[:, 0], y=Xc[:, 1], c=y[y == pred], 
marker='0', cmap='coolwarm' ) 
plt.scatter(x=Xf[:, 0], y=Xf[:, 1], c=y[y != pred], 
marker='x', cmap='coolwarnm' ) (2) 


@ Selects the correct predictions and plots them. 


@ Selects the false predictions and plots them. 


4 ry 
3 
° 
è e 
2 ° 
eo ®ef%e ex o o 
o % se e 
1 ome y o*x 00o s 
° e° 9 x e ° 
0 & e g d o 
e x e 
ry 
ae | 2 
-1 we 8 x e 
& e 
-2 ot e e 
° 2 ° 
e ° 
-3 a 
e 
-3 -2 -1 0 1 2 3 


Figure 13-28. Correct (dots) and false predictions (crosses) from GNB 


Logistic regression 


Logistic regression (LR) is a fast and scalable classification algorithm. The accuracy in 
this particular case is slightly better than with GNB: 


In [36]: from sklearn.linear_model import LogisticRegression 
In [37]: model = LogisticRegression(C=1, solver='lbfgs') 


In [38]: model.fit(X, y) 
Out[38]: LogisticRegression(C=1, class_weight=None, dual=False, 
fit_intercept=True, 
intercept_scaling=1, max_iter=100, multi_class='warn', 
n_jobs=None, penalty='12', random_state=None, solver='lbfgs', 
tol=0.0001, verbose=0, warm_start=False) 


In [39]: model.predict_proba(X).round(4)[:5] 
Out[39]: array([[0.011 , 0.989 ], 
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In [40]: 


In [41]: 
Out[41]: 


In [42]: 


In [43]: 


Decision trees 


Decision trees (DTs) are yet another type of classification algorithm that scales quite 
well. With a maximum depth of 1, the algorithm already performs slightly better than 


[0.7266, 0.2734], 
[0.971 , 0.029 ], 
[0.04 , 0.96 ], 
[0.4843, 0.5157]]) 


pred = model.predict(X) 


accuracy_score(y, pred) 
0.9 


Xc = X[y == pred] 
Xf = X[y != pred] 


plt.figure(figsize=(10, 6)) 

plt.scatter(x=Xc[:, 0], y=Xc[:, 1], c=y[y == pred], 
marker='0', cmap='coolwarm' ) 

plt.scatter(x=Xf[:, 0], y=Xf[:, 1], c=y[y != pred], 
marker='x', cmap='coolwarm'); 


both GNB and LR (see also Figure 13-29): 


In [44]: 
In [45]: 


In [46]: 
Out[46]: 


In [47]: 
Out[47]: 


In [48]: 


In [49]: 
Out [49]: 


In [50]: 


from sklearn.tree import DecisionTreeClassifier 
model = DecisionTreeClassifier(max_depth=1) 


model.fit(X, y) 
DecisionTreeClassifier(class_weight=None, criterion='gini', 
max_depth=1, 
max_features=None, max_leaf_nodes=None, 
min_impurity_decrease=0.0, min_impurity_split=None, 
min_samples_leaf=1, min_samples_split=2, 
min_weight_fraction_leaf=0.0, presort=False, random_state=None, 
splitter='best') 


model.predict_proba(X).round(4)[:5] 
array([[0.08, 0.92], 

[0.92, 0.08], 

[0.92, 0.08], 

[0.08, 0.92], 

[0.08, 0.92]]) 


pred = model.predict(X) 


accuracy_score(y, pred) 
0.92 


Xc = X[y == pred] 
Xf = X[y != pred] 
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In [51]: plt.figure(figsize=(10, 6)) 
plt.scatter(x=Xc[:, 0], y=Xc[:, 1], c=y[y == pred], 
marker='0', cmap='coolwarm' ) 
plt.scatter(x=Xf[:, 0], y=Xf[:, 1], c=y[y != pred], 
marker='x', cmap='coolwarm'); 
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Figure 13-29. Correct (dots) and false predictions (crosses) from DT (max_depth=1) 


However, increasing the maximum depth parameter for the decision tree allows one 
to reach a perfect result: 


In [52]: print('{:>8s} | {:8s}'.format('depth', 'accuracy')) 

print(20 * '-') 

for depth in range(1, 7): 
model = DecisionTreeClassifier(max_depth=depth) 
model.fit(X, y) 
acc = accuracy_score(y, model.predict(X)) 
print('{:8d} | {:8.2f}'.format(depth, acc)) 
depth | accuracy 
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Deep neural networks 


Deep neural networks (DNNs) are considered to be among the most powerful—but 
also computationally demanding—algorithms for both estimation and classification. 
The open sourcing of the TensorFlow package by Google and related success stories 
are in part responsible for their popularity. DNNs are capable of learning and model- 
ing complex nonlinear relationships. Although their origins date back to the 1970s, 
they only recently have become feasible on a large scale due to advances in hardware 
(CPUs, GPUs, TPUs), numerical algorithms, and related software implementations. 


While other ML algorithms, such as linear models of LR type, can be fitted efficiently 
based on a standard optimization problem, DNNs rely on deep learning, which 
requires in general a large number of repeated steps to adjust certain parameters 
(weights) and compare the results to the data. In that sense, deep learning can be 
compared to Monte Carlo simulation in mathematical finance where the price of, say, 
a European call option can be estimated on the basis of 100,000 simulated paths for 
the underlying. On the other hand, the Black-Scholes-Merton option pricing formula 
is available in closed form and can be evaluated analytically. 


While Monte Carlo simulation is among the most flexible and powerful numerical 
techniques in mathematical finance, there’s a cost to pay in terms of the high compu- 
tational burden and large memory footprint. The same holds true for deep learning, 
which is more flexible in general than many other ML algorithms but which requires 
greater computational power. 


DNNs with scikit-learn. Although it is quite different in nature, scikit- learn provides 
the same API for its MLPClassifier algorithm class," which is a DNN model, as for 
the other ML algorithms used before. With just two so-called hidden layers it reaches 
a perfect result on the test data (the hidden layers are what make deep learning out of 
simple learning—e.g., “learning” weights in the context of a linear regression instead 
of using OLS regression to derive them directly): 


In [53]: from sklearn.neural_network import MLPClassifier 


In [54]: model = MLPClassifier(solver='lbfgs', alpha=1e-5, 
hidden_layer_sizes=2 * [75], random_state=10) 


In [55]: %time model.fit(X, y) 
CPU times: user 537 ms, sys: 14.2 ms, total: 551 ms 
Wall time: 340 ms 


Out[55]: MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', 
beta_1=0.9, 


11 For more details and available parameters, refer to the documentation on the multi-layer perceptron classi- 
fier. 
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beta_2=0.999, early_stopping=False, epsilon=ie-08, 
hidden_lLayer_sizes=[75, 75], learning_rate='constant', 
learning_rate_init=0.001, max_iter=200, momentum=0.9, 
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, 
random_state=10, shuffle=True, solver='Lbfgs', tol=0.0001, 
validation_fraction=0.1, verbose=False, warm_start=False) 


In [56]: pred = model.predict(X) 


Out[56]: 


pred 
array([4,. 0-0; Ty 2; @,. 2, 2s 1y 0; 2;0;. B,..@,. 2,1, Oy 2, 04. 4, 4; 


In [57]: accuracy_score(y, pred) 


Out[57]: 


1.0 


DNNs with TensorFlow. The API of TensorFlow is different from the scikit-learn 
standard. However, the application of the DNNClassifier class is similarly straight- 


forward: 


In [58]: 


In [59]: 


In [60]: 


In [61]: 


In [62]: 


Out[62]: 


In [63]: 
: {'loss': 


Out[63] 


import tensorflow as tf 
tf. logging. set_verbosity(tf. logging. ERROR) (1) 


fc = [tf.contrib.layers.real_valued_column('features')] (2) 


model = tf.contrib.learn.DNNClassifier(hidden_units=5 * [250], 


n_classes=2, 
feature_columns=fc) © 


def input_fn(): (4) 


fc = {'features': tf.constant(X)} 
la = tf.constant(y) 
return fc, la 


%time model. fit(input_fn=input_fn, steps=100) (5) 
CPU times: user 7.1 s, sys: 1.35 s, total: 8.45 s 
Wall time: 4.71 s 


DNNCLassifier(params={'head': 


<tensorflow.contrib.learn.python.learn ... head. _BinaryLogisticHead 
object at Ox1a3ee692b0>, 'hidden_units': [250, 250, 250, 250, 250], 
'feature_columns': (_RealValuedColumn(column_name='features', 
dimension=1, default_value=None, dtype=tf.float32, normalizer=None),), 
‘optimizer': None, 'activation_fn': <function relu at 0x1a3aa75b70>, 
'dropout': None, 'gradient_clip_norm': None, 
‘embedding_lr_multipliers': None, 'input_layer_min_slice_size': None}) 


model.evaluate(input_fn=input_fn, steps=1) (5) 


0.18724777, 
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In [64]: 
Out[64]: 


In [65]: 


Out[65]: 


In [66]: 


Out[66] 


"accuracy': 0.91, 

"Labels/prediction_mean': 0.5003989, 
"Labels/actual_label_mean': 0.5, 
"accuracy/baseline_Label_mean': 0.5, 

"auc': 0.9782, 

"auc_precision_recall': 0.97817385, 
"accuracy/threshold_0.500000_mean': 0.91, 
'precision/positive_threshold_0.500000_mean': 0.9019608, 
'recall/positive_threshold_@.500000_mean': 0.92, 
"global_step': 100} 


pred = np.array(list(model.predict(input_fn=input_fn))) Q 
pred[:10] 
array([1, 0, ©, 2; 1, 0, 1, 1, 1,- 1]) 


%time model.fit(input_fn=input_fn, steps=750) (7) 
CPU times: user 29.8 s, sys: 7.51 s, total: 37.3 s 
Wall time: 13.6 s 


DNNCLassifier(params={'head': 
<tensorflow.contrib.learn.python.learn ... head. BinaryLogisticHead 
object at Ox1a3ee692b0>, 'hidden_units': [250, 250, 250, 250, 250], 
'feature_columns': (_RealValuedColumn(column_name='features', 
dimension=1, default_value=None, dtype=tf.float32, normalizer=None),), 
‘optimizer': None, 'activation_fn': <function relu at 0x1a3aa75b70>, 
‘dropout': None, 'gradient_clip_norm': None, 
‘embedding_lr_multipliers': None, 'input_layer_min_slice_size': None}) 


model.evaluate(input_fn=input_fn, steps=1) (8) 
{'loss': 0.09271307, 
'accuracy': 0.94, 
'labels/prediction_mean': 0.5274486, 
'labels/actual_label_mean': 0.5, 
'accuracy/baseline_label_mean': 0.5, 
'auc': 0.99759996, 
'auc_precision_recall': 0.9977609, 
'accuracy/threshold_0.500000_mean': 0.94, 
'precision/positive_threshold_0.500000_mean': 0.9074074, 
'recall/positive_threshold_0.500000_mean': 0.98, 
'global_step': 850} 


Sets the verbosity for TensorFlow logging. 


Defines the real-valued features abstractly. 


Instantiates the model object. 


Features and label data are to be delivered by a function. 


Fits the model through learning and evaluates it. 
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Predicts the label values based on the feature values. 


Retrains the model based on more learning steps; the previous results are taken 
as a starting point. 


© Accuracy increases after retraining. 


This only scratches the surface of TensorFlow, which is used in a number of demand- 
ing use cases, such as Alphabet Inc.’s effort to build self-driving cars. In terms of 
speed, the training of TensorFlow’s models in general benefits significantly from the 
use of specialized hardware such as GPUs and TPUs instead of CPUs. 


Feature transforms 


For a number of reasons, it might be beneficial or even necessary to transform real- 
valued features. The following code shows some typical transformations and visual- 
izes the results for comparison in Figure 13-30: 


In [67]: from sklearn import preprocessing 


In [68]: X[:5] 

Out[68]: array([[ 1.6876, -0.7976], 
[-0.4312, -0.7606], 
[-1.4393, -1.2363], 
[ 1.118 , -1.8682], 
[ 0.0502, 0.659 ]]) 


In [69]: Xs = preprocessing.StandardScaler().fit_transform(X) (1) 
Xs[:5] 
Out[69]: array([[ 1.2881, -0.5489], 
[-0.3384, -0.5216], 
[-1.1122, -0.873 ], 
[ 0.8509, -1.3399], 
[ 0.0312, 0.5273]]) 


In [70]: Xm = preprocessing.MinMaxScaler().fit_transform(X) (2) 
Xm[:5] 
Out[70]: array([[0.7262, 0.3563], 
[0.3939, 0.3613], 
[0.2358, 0.2973], 
[0.6369, 0.2122], 
[0.4694, 0.5523]]) 


In [71]: Xn1 = preprocessing.Normalizer(norm='11').transform(X) © 
Xn1[:5] 
Out[71]: array([[ 0.6791, -0.3209], 
[-0.3618, -0.6382], 
[-0.5379, -0.4621], 
[ 0.3744, -0.6256], 
[ 0.0708, 0.9292]]) 
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In [72]: Xn2 = preprocessing.Normalizer(norm='12').transform(X) © 
Xn2[:5] 
Out[72]: array([[ 0.9041, -0.4273], 
[-0.4932, -0.8699], 
[-0.7586, -0.6516], 
[ 0.5135, -0.8581], 
[ 0.076 , 0.9971]]) 


In [73]: plt.figure(figsize=(10, 6)) 
markers: = [nor Tn TE A er] 
data_sets = [X, Xs, Xm, Xn1, Xn2] 
labels = ['raw', 'standard', 'minmax', 'norm(1)', 'norm(2)'] 
for x, m, l in zip(data_sets, markers, labels): 
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, 
marker=m, cmap='coolwarm', lLabel=1L) 
plt.legend(); 


@ Transforms the features data to standard normally distributed data with zero 
mean and unit variance. 


@ Transforms the features data to a given range for every feature as defined by the 
minimum and maximum values per feature. 


© Scales the features data individually to the unit norm (L1 or L2). 


4 @ @ raw 
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3 A norm(1) 
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-3 e 
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Figure 13-30. Raw and transformed data in comparison 
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In terms of pattern recognition tasks, a transformation to categorical features is often 
helpful or even required to achieve acceptable results. To this end, the real values of 
the features are mapped to a limited, fixed number of possible integer values (cate- 
gories, classes): 


In [74]: X[:5] 

Out[74]: array([[ 1.6876, -0.7976], 
[-0.4312, -0.7606], 
[-1.4393, -1.2363], 
[ 1.118 , -1.8682], 
[ 0.0502, 0.659 ]]) 


In [75]: Xb = preprocessing.Binarizer().fit_transform(X) (1) 
Xb[:5] 

Out[75]: array([[1. 

[0. 

[0. 

[3s 


.] 
.] 
-] 
“J 
[1 “J 


D 


In [76]: 2 ** 2 © 
Out[76]: 4 


In [77]: Xd = np.digitize(X, bins=[-1, 0, 1]) © 
Xd[:5] 
Out[77]: array([[3, 1], 
[1, 1], 
[0, 0], 
[3, 0], 
[2, 2]]) 


In [78]: 4** 2 ® 
Out[78]: 16 


Transforms the features to binary features. 


The number of possible feature value combinations for two binary features. 


Transforms the features to categorical features based on a list of values used for 
binning. 


© The number of possible feature value combinations, with three values used for 
binning for two features. 


Train-test splits: Support vector machines 


At this point, every seasoned ML researcher and practitioner reading this probably 
has concerns with regard to the implementations in this section: they all rely on the 
same data for training, learning, and prediction. The quality of an ML algorithm can 


Machine Learning | 459 


of course be better judged when different data (sub)sets are used for training and 
learning on the one hand and testing on the other hand. This comes closer to a real- 
world application scenario. 


Again, scikit- learn provides a function to accomplish such an approach efficiently. 
In particular, the train_test_split() function allows the splitting of data sets into 
training and test data in a randomized, but nevertheless repeatable, fashion. 


The following code uses yet another classification algorithm, the support vector 
machine (SVM). It first fits the SVM model based on the training data: 


In [79]: from sklearn.svm import SVC 
from sklearn.model_selection import train_test_split 


In [80]: train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.33, 
random_state=0) 


In [81]: model = SVC(C=1, kernel='Linear') 


In [82]: model. fit(train_x, train_y) (1) 
Out[82]: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, 
decision_function_shape='ovr', degree=3, gamma='auto_deprecated', 
kernel='lLinear', max_iter=-1, probability=False, random_state=None, 
shrinking=True, tol=0.001, verbose=False) 


In [83]: pred_train = model.predict(train_x) (2) 


In [84]: accuracy_score(train_y, pred_train) © 
Out[84]: 0.9402985074626866 


@ Fits the model based on the training data. 
@ Predicts the training data label values. 


© The accuracy of the training data prediction (“in-sample”). 


Next, the testing of the fitted model based on the test data. Figure 13-31 shows the 
correct and false predictions for the test data. The accuracy on the test data is—as one 
would naturally expect—lower than on the training data: 


In [85]: pred_test = model.predict(test_x) (1) 


In [86]: test_y == pred_test (2) 

Out[86]: array([ True, True, True, True, True, True, True, True, True, 
True, False, False, False, True, True, True, False, False, 
False, True, True, True, True, True, True, True, True, 
True, True, True, True, False, True]) 


In [87]: accuracy_score(test_y, pred_test) (2) 
Out[87]: 0.7878787878787878 
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In [88]: test_c = test_x[test_y == pred_test] 
test_f = test_x[test_y != pred_test] 


In [89]: plt.figure(figsize=(10, 
plt.scatter(x=test_c[:, 
marker='o', 
plt.scatter(x=test_f[:, 
marker='x', 


6)) 

0], y=test_c[:, 1], c=test_y[test_y == pred_test], 
cmap='coolwarn' ) 

0], y=test_f[:, 1], c=test_y[test_y != pred_test], 
cmap='coolwarnm' ); 


Predicts the testing data label values based on the test data. 


Evaluates the accuracy of the fitted model for the test data (“out-of-sample”). 
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Figure 13-31. Correct (dots) and false predictions (crosses) from SVM for test data 


The SVM classification algorithm provides a number of options for the kernel to be 
used. Depending on the problem at hand, different kernels might lead to quite differ- 
ent results (i.e., accuracy scores), as the following analysis shows. The code first trans- 
forms the real-valued features into categorical ones: 


In [90]: bins = np.linspace(-4.5, 4.5, 50) 


In [91]: Xd = np.digitize(X, bins=bins) 


In [92]: 
Out[92]: 


Xd[:5] 

array([[34, 
[23, 
[17, 
[31, 
[25, 


21], 
21], 
18], 
15], 


29]]) 
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In [93]: train_x, test_x, train_y, test_y = train_test_split(Xd, y, test_size=0.33, 
random_state=0) 


In [94]: print('{:>8s} | {:8s}'.format('kernel', 'accuracy')) 

print(20 * '-') 

for kernel in ['linear', 'poly', 'rbf', 'sigmoid']: 
model = SVC(C=1, kernel=kernel, gamma='auto') 
model. fit(train_x, train_y) 
acc = accuracy_score(test_y, model.predict(test_x)) 
print('{:>8s} | {:8.3f}'.format(kernel, acc)) 

kernel | accuracy 


linear | 0.848 

poly | 0.758 

rbf | 0.788 

sigmoid | 0.455 
Conclusion 


Statistics is not only an important discipline in its own right, but also provides indis- 
pensable tools for many other disciplines, like finance and the social sciences. It is 
impossible to give a broad overview of such a large subject in a single chapter. This 
chapter therefore focuses on four important topics, illustrating the use of Python and 
several statistics libraries on the basis of realistic examples: 


Normality 
The normality assumption with regard to financial market returns is an impor- 
tant one for many financial theories and applications; it is therefore important to 
be able to test whether certain time series data conforms to this assumption. As 
seen in “Normality Tests” on page 398—via graphical and statistical means— 
real-world return data generally is not normally distributed. 


Portfolio optimization 
MPT, with its focus on the mean and variance/volatility of returns, can be con- 
sidered not only one of the first but also one of the major conceptual successes of 
statistics in finance; the important concept of investment diversification is beauti- 
fully illustrated in this context. 


Bayesian statistics 
Bayesian statistics in general (and Bayesian regression in particular) has become 
a popular tool in finance, since this approach overcomes some shortcomings of 
other approaches, as introduced, for instance, in Chapter 11; even if the mathe- 
matics and the formalism are more involved, the fundamental ideas—like the 
updating of probability/distribution beliefs over time—are easily grasped (at least 
intuitively). 
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Machine learning 


Nowadays, machine learning has established itself in the financial domain along- 
side traditional statistical methods and techniques. The chapter introduces ML 
algorithms for unsupervised learning (such as k-means clustering) and super- 
vised learning (such as DNN classifiers) and illustrates selected related topics, 
such as feature transforms and train-test splits. 


Further Resources 


For more information on the topics and packages covered in this chapter, consult the 
following online resources: 


The documentation on SciPy’s statistical functions 

The documentation of the statsmodels library 

Details on the optimization functions used in this chapter 
The documentation for PyMC3 


The documentation for scikit- learn 


Useful references in book form for more background information are: 


Albon, Chris (2018). Machine Learning with Python Cookbook. Sebastopol, CA: 
O'Reilly. 
Alpaydin, Ethem (2016). Machine Learning. Cambridge, MA: MIT Press. 


Copeland, Thomas, Fred Weston, and Kuldeep Shastri (2005). Financial Theory 
and Corporate Policy. Boston, MA: Pearson. 

Downey, Allen (2013). Think Bayes. Sebastopol, CA: O'Reilly. 

Geweke, John (2005). Contemporary Bayesian Econometrics and Statistics. Hobo- 
ken, NJ: John Wiley & Sons. 

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The Elements of 
Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer. 
James, Gareth, et al. (2013). An Introduction to Statistical Learning— With Appli- 
cations in R. New York: Springer. 

Lopez de Prado, Marcos (2018). Advances in Financial Machine Learning. Hobo- 
ken, NJ: John Wiley & Sons. 

Rachev, Svetlozar, et al. (2008). Bayesian Methods in Finance. Hoboken, NJ: John 
Wiley & Sons. 


VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 
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The paper introducing modern portfolio theory is: 


Markowitz, Harry (1952). “Portfolio Selection.” Journal of Finance, Vol. 7, pp. 77-91. 
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PART IV 
Algorithmic Trading 


This part of the book is about the use of Python for algorithmic trading. More and 
more trading platforms and brokers allow their clients to use, for example, REST 
APIs to programmatically retrieve historical data or streaming data, or to place buy 
and sell orders. What has been the domain of large financial institutions for a long 
period now has become accessible even to retail algorithmic traders. In this space, 
Python has secured a top position as a programming language and technology plat- 
form. Among other factors, this is driven by the fact that many trading platforms, 
such as the one from FXCM Forex Capital Markets, provide easy-to-use Python 
wrapper packages for their REST APIs. 


This part of the book comprises three chapters: 


e Chapter 14 introduces the FXCM trading platform, its REST API, and the fxcmpy 
wrapper package. 


e Chapter 15 focuses on the use of methods from statistics and machine learning to 
derive algorithmic trading strategies; the chapter also shows how to use vector- 
ized backtesting. 


e Chapter 16 looks at the deployment of automated algorithmic trading strategies; 
it addresses capital management, backtesting for performance and risk, online 
algorithms, and deployment. 


CHAPTER 14 


The FXCM Trading Platform 


Financial institutions like to call what they do trading. Let’s be honest. It’s not trading; 
it’s betting. 
—Graydon Carter 


This chapter introduces the trading platform from FXCM Group, LLC (“FXCM” 
hereafter), with its RESTful and streaming application programming interface (API), 
as well as the Python wrapper package fxcmpy. FXCM offers to retail and institu- 
tional traders a number of financial products that can be traded both via traditional 
trading applications and programmatically via the API. The focus of the products lies 
on currency pairs as well as contracts for difference (CFDs) on major stock indices 
and commodities, etc. 
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Risk Disclaimer 


Trading forex/CFDs on margin carries a high level of risk and may 
not be suitable for all investors as you could sustain losses in excess 
of deposits. Leverage can work against you. The products are 
intended for retail and professional clients. Due to the certain 
restrictions imposed by the local law and regulation, German resi- 
dent retail client(s) could sustain a total loss of deposited funds but 
are not subject to subsequent payment obligations beyond the 
deposited funds. Be aware and fully understand all risks associated 
with the market and trading. Prior to trading any products, care- 
fully consider your financial situation and experience level. Any 
opinions, news, research, analyses, prices, or other information is 
provided as general market commentary, and does not constitute 
investment advice. The market commentary has not been prepared 
in accordance with legal requirements designed to promote the 
independence of investment research, and it is therefore not sub- 
ject to any prohibition on dealing ahead of dissemination. FKCM 
and the author will not accept liability for any loss or damage, 
including without limitation to, any loss of profit, which may arise 
directly or indirectly from use of or reliance on such information. 


The trading platform of FXCM allows even individual traders with smaller capital 
positions to implement and deploy algorithmic trading strategies. 


This chapter covers the basic functionalities of the FXCM trading API and the 
fxcmpy Python package required to implement an automated algorithmic trading 
strategy programmatically. It is structured as follows: 


“Getting Started” on page 469 
This section shows how to set up everything to work with the FKCM REST API 
for algorithmic trading. 


“Retrieving Data” on page 469 
This section shows how to retrieve and work with financial data (down to the 
tick level). 


“Working with the API” on page 474 
This section illustrates typical tasks implemented using the REST API, such as 
retrieving historical and streaming data, placing orders, and looking up account 
information. 
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Getting Started 


Detailed documentation of the FXCM API is found at https://fxcm.github.io/rest-api- 


docs. To install the Python wrapper package fxcmpy, execute this command in the 
shell: 


pip install fxcmpy 
The documentation for the fxcmpy package is found at http://fxcmpy.tpq.io. 


To get started with the FXCM trading API and the fxcmpy package, a free demo 
account with FXCM is sufficient.’ The next step is to create a unique API token—say, 
YOUR_FXCM_API_TOKEN—from within the demo account. A connection to the API is 
then opened, for example, via: 

import fxcmpy 

api = fxcmpy.fxcmpy(access_token=YOUR_FXCM_API_TOKEN, log_level='error') 
Alternatively, a configuration file (say, frcm.cfg) can be used to connect to the API. 
This file’s contents should look as follows: 

[FXCM] 

log_level = error 


log_file = PATH_TO_AND_NAME_OF_LOG_FILE 
access_token = YOUR_FXCM_API_TOKEN 


One can then connect to the API via: 


import fxcmpy 
api = fxcmpy.fxcmpy(config_file='fxcm.cfg') 


By default, the fxcmpy class connects to the demo server. However, by the use of the 
server parameter, the connection can be made to the live trading server (if such an 
account exists): 


api = fxcmpy.fxcmpy(config_file='fxcm.cfg', server='demo') (13 
api = fxcmpy.fxcmpy(config_file='fxcm.cfg', server='real') (2) 


@ Connects to the demo server. 


© Connects to the live trading server. 


Retrieving Data 


FXCM provides access to historical market price data sets, such as tick data, in a pre- 
packaged variant. This means that one can retrieve, for instance, compressed files 
from FXCM servers that contain tick data for the EUR/USD exchange rate for week 


1 Note that FXCM demo accounts are only offered for certain countries. 
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26 of 2018, as described in the following subsection. The retrieval of historical candles 


data from the API is explained in the subsequent subsection. 


Retrieving Tick Data 


For a number of currency pairs, FKCM provides historical tick data. The fxcmpy 


package makes retrieval of such tick data and working with it convenient. 
imports: 


In [1]: import time 
import numpy as np 
import pandas as pd 
import datetime as dt 
from pylab import mpl, plt 


In [2]: plt.style.use('seaborn') 
mpL.rcParams['font.family'] = 'serif' 
%matplotlib inline 


First, some 


Second, a look at the available symbols (currency pairs) for which tick data is 


available: 


In [3]: from fxcmpy import fxcmpy_tick_data_reader as tdr 


In [4]: print(tdr.get_available_symbols()) 

('AUDCAD', 'AUDCHF', 'AUDJPY', 'AUDNZD', 'CADCHF', 'EURAUD', 
"EURGBP', 'EURJPY', 'EURUSD', 'GBPCHF', 'GBPJPY', 'GBPNZD', 
"GBPCHF', 'GBPJPY', 'GBPNZD', 'NZDCAD', 'NZDCHF', 'NZDJPY', 
"USDCAD', 'USDCHF', 'USDIPY') 


"EURCHF', 
"GBPUSD', 
"NZDUSD', 


The following code retrieves one week’s worth of tick data for a single symbol. The 


resulting pandas DataFrame object has more than 1.5 million data rows: 


In [5]: start = dt.datetime(2018, 6, 25) 1] 
stop = dt.datetime(2018, 6, 30) @ 


In [6]: td = tdr('EURUSD', start, stop) @ 


In [7]: td.get_raw_data().info() (2) 
<class 'pandas.core.frame.DataFrame'> 
Index: 1963779 entries, 06/24/2018 21:00:12.290 to 06/29/2018 
20:59:00.607 
Data columns (total 2 columns): 
Bid float64 
Ask float64 
dtypes: float64(2) 
memory usage: 44.9+ MB 


In [8]: td.get_data().info() © 
<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 1963779 entries, 2018-06-24 21:00:12.290000 to 
20:59:00.607000 


2018-06-29 
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In [9]: 
Out[9]: 


Data columns (total 2 columns): 
Bid float64 

Ask float64 

dtypes: float64(2) 

memory usage: 44.9 MB 


td.get_data().head() 

Bid Ask 
2018-06-24 21:00:12.290 1.1662 1.16660 
2018-06-24 21:00:16.046 1.1662 1.16650 
2018-06-24 21:00:22.846 1.1662 1.16658 
2018-06-24 21:00:22.907 1.1662 1.16660 
2018-06-24 21:00:23.441 1.1662 1.16663 


@ This retrieves the data file, unpacks it, and stores the raw data in a DataFrame 
object (as an attribute to the resulting object). 


© The td. 


get_raw_data() method returns the DataFrame object with the raw data; 


i.e., with the index values still being str objects. 


© The td. 


get_data() method returns a DataFrame object for which the index has 


been transformed to a DatetimeIndex. 


Since the tick data is stored in a DataFrame object, it is straightforward to pick a sub- 
set of the data and to implement typical financial analytics tasks on it. Figure 14-1 
shows a plot of the mid prices derived for the subset and a simple moving average 


(SMA): 


In [10]: 


In [11]: 
Out[11]: 


In [12]: 
In [13]: 


In [14]: 


sub = td.get_data(start='2018-06-29 12:00:00', 
end='2018-06-29 12:15:00') (1) 


sub.head() 

Bid Ask 
2018-06-29 12:00:00.011 1.16497 1.16498 
2018-06-29 12:00:00.071 1.16497 1.16497 
2018-06-29 12:00:00.079 1.16497 1.16498 
2018-06-29 12:00:00.091 1.16495 1.16498 
2018-06-29 12:00:00.205 1.16496 1.16498 


sub['Mid'] = sub.mean(axis=1) (2) 


sub['SMA'] = sub['Mid'].rolling(1000).mean() © 


sub[['Mid', 'SMA']].plot(figsize=(10, 6), lw=0.75); 


@ Picks a subset of the complete data set. 


© Calculates the mid prices from the bid and ask prices. 
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© Derives SMA values over intervals of 1,000 ticks. 
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Figure 14-1. Historical mid tick prices for EUR/USD and SMA 


Retrieving Candles Data 


FXCM also provides access to historical candles data (beyond the API)—i.e., to data 
for certain homogeneous time intervals (“bars”) with open, high, low, and close val- 
ues for both bid and ask prices. 


First, a look at the available symbols for which candles data is provided: 


In [15]: from fxcmpy import fxcmpy_candles_data_reader as cdr 


In [16]: print(cdr.get_available_symbols()) 
(‘AUDCAD', 'AUDCHF', 'AUDJPY', 'AUDNZD', 'CADCHF', 'EURAUD', 'EURCHF', 
'EURGBP', 'EURJPY', 'EURUSD', 'GBPCHF', 'GBPJPY', 'GBPNZD', 'GBPUSD', 
'GBPCHF', 'GBPJPY', 'GBPNZD', 'NZDCAD', 'NZDCHF', 'NZDJPY', 'NZDUSD', 
'USDCAD', 'USDCHF', 'USDJPY') 
Second, the data retrieval itself. It is similar to the tick data retrieval. The only differ- 
ence is that a period value—i.e., the bar length—needs to be specified (e.g., m1 for one 
minute, H1 for one hour, or D1 for one day): 


In [17]: start = dt.datetime(2018, 5, 1) 
stop = dt.datetime(2018, 6, 30) 


In [18]: period = 'H1' (1) 


In [19]: candles = cdr('EURUSD', start, stop, period) 
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In [20]: 


In [21]: 


In [22]: 
Out[22]: 


In [23]: 
Out[23]: 


data = candles.get_da 


data.info() 


ta() 


<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 1080 entries, 2018-04-29 21:00:00 to 2018-06-29 20:00:00 
Data columns (total 8 columns): 

null float64 
null float64 
null float64 
null float64 
null float64 
null float64 
null float64 
null float64 


BidOpen 1080 non- 
BidHigh 1080 non- 
BidLow 1080 non- 
BidClose 1080 non- 
AskOpen 1080 non- 
AskHigh 1080 non- 
AskLow 1080 non- 


AskClose 1080 non- 
dtypes: float64(8) 
memory usage: 75.9 KB 


data[data.columns[:4]].tail() @ 


2018-06-29 16:00:00 
2018-06-29 17:00:00 
2018-06-29 18:00:00 
2018-06-29 19:00:00 
2018-06-29 20:00:00 


data[data.columns[4: ] 


2018-06-29 16:00:00 
2018-06-29 17:00:00 
2018-06-29 18:00:00 
2018-06-29 19:00:00 
2018-06-29 20:00:00 


@ Specifies the period value. 


@ Open, high, low, close values for the bid prices. 


© Open, high, low, close values for the ask prices. 


BidOpen 
1.16768 
. 16769 
.16781 
. 16684 
.16774 


PPP PRP 


].tail() 
AskOpen 
1.16769 
1.16771 
1.16782 
1.16686 
1.16775 


BidHigh 
1.16820 
1.16826 
1.16816 
1.16792 
1.16904 


© 
AskHigh 
1.16820 
1.16827 
1.16817 
1.16794 
1.16907 


BidLow 
1.16732 
1.16709 
1.16668 
1.16638 
1.16758 


AskLow 
4.16732 
1.16711 
1.16669 
1.16640 
1.16760 


BidClose 
1.16769 
1.16781 
1.16684 
1.16774 
1.16816 


AskClose 
1.16771 
1.16782 
1.16686 
1.16775 
1.16861 


To conclude this section, the following code calculates mid close prices and two 
SMAs, and plots the results (see Figure 14-2): 


In [24]: 


In [25]: 


In [26]: 


data['MidClose'] = data[['BidClose', 'AskClose']].mean(axis=1) @ 


data['SMA1'] 
data['SMA2'] 


data['MidClose'].rolling(30).mean() @ 
data['MidClose'].rolling(100) .mean() (2) 


data[['MidClose', 'SMA1', 'SMA2']].plot(figsize=(10, 6)); 


@ Calculates the mid close prices from the bid and ask close prices. 
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© Calculates two SMAs, one for a shorter time interval, one for a longer one. 
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Figure 14-2. Historical hourly mid close prices for EUR/USD and two SMAs 


Working with the API 


While the previous sections demonstrate retrieving prepackaged historical tick data 
and candles data from FXCM servers, this section shows how to retrieve historical 
data via the API. For this, a connection object to the FXCM API is needed. Therefore, 
first the import of the fxcmpy package, the connection to the API (based on the 
unique API token), and a look at the available instruments: 


In [27]: 


In [28]: 
Out[28]: 


In [29]: 
In [30]: 


In [31]: 


import fxcmpy 


fxcmpy.__version__ 
Mi REEN 


api = fxcmpy.fxcmpy(config_file='../fxcm.cfg') (1) 

instruments = api.get_instruments() 

print(instruments) 

['EUR/USD', 'XAU/USD', 'GBP/USD', 'UK100', 'USDOLLAR', 'XAG/USD', 'GER30', 


"FRA40', 'USD/CNH', 'EUR/JPY', 'USD/JPY', 'CHN50', 'GBP/JPY', 'AUD/JPY', 
'CHF/JPY', "USD/CHF"; 'GBP/CHF', 'AUD/USD', 'EUR/AUD', "EUR/CHF"; 


"EUR/CAD', 
'GBP/AUD', 
'USD/SEK', 
'EUR/NZD', 
'EUR/TRY', 


'EUR/GBP', 'AUD/CAD', 'NZD/USD', 'USD/CAD', 'CAD/JPY', 


'NZD/JPY', 
'EUR/SEK', 
'USD/ZAR', 
'NZD/CHF', 


'US30', 'GBP/CAD', 'SOYF', 'GBP/NZD', 'AUD/NZD', 
'EUR/NOK', 'USD/NOK', 'USD/MXN', 'AUD/CHF', 
'USD/HKD', 'ZAR/JPY', 'BTC/USD', 'USD/TRY', 
'CAD/CHF', 'NZD/CAD', 'TRY/JPY', 'AUS200', 
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"ESP35', 'HKG33', 'JPN225', 'NAS100', 'SPX500', 'Copper', 'EUSTX50', 
"USOiL', 'UKOiL', 'NGAS', "Bund"] 


@ This connects to the API; adjust the path/filename. 


Retrieving Historical Data 


Once connected, data retrieval for specific time intervals is accomplished via a single 
method call. When using the get_candles() method, the parameter period can be 
one of m1, m5, m15, m30, H1, H2, H3, H4, H6, H8, D1, W1, or M1. The following code gives a 
few examples. Figure 14-3 shows one-minute bar ask close prices for the EUR/USD 
instrument (currency pair): 


In [32]: candles = api.get_candles('USD/JPY', period='D1', number=10) 1) 


In [33]: candles[candles.columns[:4]] (13 

Out[33]: bidopen bidclose bidhigh bidlow 
date 
2018-10-08 21:00:00 113.760 113.219 113.937 112.816 
2018-10-09 21:00:00 113.219 112.946 113.386 112.863 
2018-10-10 21:00:00 112.946 112.267 193.281 112.239 
2018-10-11 21:00:00 112.267 112.155 112.528 111.825 
2018-10-12 21:00:00 112.155 112.200 112.491 111.873 
2018-10-14 21:00:00 112.163 112.130 112.270 112.109 
2018-10-15 21:00:00 112.130 111.758 112.230 111,619 
2018-10-16 21:00:00 112.151 112230. 112333 111.027 
2018-10-17 21:00:00 112.238 112.636 112,670 112,009 
2018-10-18 21:00:00 112.636 112.168 112.725 111.942 


In [34]: candles[candles.columns[4:]] (13 

Out[34]: askopen askclose askhigh asklow tickqty 
date 
2018-10-08 21:00:00 113.840 113.244 113.950 112.827 184835 
2018-10-09 21:00:00 113.244 1127970 113.399 112.875 321755 
2018-10-10 21:00:00 112.970 112.287 113.294 112.265 329174 
2018-10-11 21:00:00 112.287 112.175 112.541 111,835 568231 
2018-10-12 21:00:00 112.175 112.243 112.504 111.885 363233 
2018-10-14 21:00:00 112.219 112.181 112.294 112.145 581 
2018-10-15 21:00:00 112.181 111.781. 112.243 111.631 322304 
2018-10-16 21:00:00 112.163 112.271 112.345 111.740 253420 
2018-10-17 21:00:00 112.271 112.664 112.682 112,022 542166 
2018-10-18 21:00:00 112.664 112.237 112.738 111.955 369012 


In [35]: start = dt.datetime(2017, 1, 1) @ 
end = dt.datetime(2018, 1, 1) (2) 


In [36]: candles = api.get_candles('EUR/GBP', period='D1', 
start=start, stop=end) (2) 


In [37]: candles.info() (2) 
<class 'pandas.core.frame.DataFrame'> 
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DatetimeIndex: 309 entries, 2017-01-03 22:00:00 to 2018-01-01 22:00:00 
Data columns (total 9 columns): 


bidopen 309 non-null float64 
bidclose 309 non-null float64 
bidhigh 309 non-null float64 
bidlow 309 non-null float64 
askopen 309 non-null float64 
askclose 309 non-null float64 
askhigh 309 non-null float64 
asklow 309 non-null float64 
tickqty 309 non-null int64 


dtypes: float64(8), int64(1) 
memory usage: 24.1 KB 


In [38]: candles = api.get_candles('EUR/USD', period='m1', number=250) © 
In [39]: candles['askclose'].plot(figsize=(10, 6)) 
@ Retrieves the 10 most recent end-of-day prices. 


Retrieves end-of-day prices for a whole year. 


© Retrieves the most recent one-minute bar prices available. 
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Figure 14-3. Historical ask close prices for EUR/USD (minute bars) 
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Retrieving Streaming Data 


While historical data is important to, for example, backtest algorithmic trading strate- 
gies, continuous access to real-time or streaming data (during trading hours) is 
required to deploy and automate algorithmic trading strategies. The FKCM API 
allows for the subscription to real-time data streams for all instruments. The fxcmpy 
wrapper package supports this functionality, among others, in that it allows users to 
provide user-defined functions (so-called callback functions) to process the real-time 


data stream. 


The following code presents a simple callback function—it only prints out selected 
elements of the data set retrieved—and uses it to process data retrieved in real time 


after subscribing to the desired instrument (here, EUR/USD): 


In [40]: def output(data, dataframe): 
print('%3d | %s | %s | %6.5f, %6.5f' 
% (len(dataframe), data['Symbol'], 


pd.to_datetime(int(data['Updated']), unit=' 


data['Rates'][0], data['Rates'][1])) (13 


In [41]: api.subscribe_market_data('EUR/USD', (output,)) (2) 


In [42]: api.get_last_price('EUR/USD') © 
Out[42]: Bid 1.14696 
Ask 1.14709 
High 1.14775 
Low 1.14323 
Name: 2018-10-19 11:36:45.247000, dtype: float64 


In [43]: apt.unsubscribe_market_data('EUR/USD') 4 


8 | EUR/USD | 2018-10-19 11:36:48.239000 | 1.14696, 1. 


The callback function that prints out certain elements of the retrieved data set. 


1 | EUR/USD | 2018-10-19 11:36:39.735000 | 1.14694, 1 
2 | EUR/USD | 2018-10-19 11:36:39.776000 | 1.14694, 1 
3 | EUR/USD | 2018-10-19 11:36:40.714000 | 1.14695, 1 
4 | EUR/USD | 2018-10-19 11:36:41.646000 | 1.14696, 1. 
5 | EUR/USD | 2018-10-19 11:36:41.992000 | 1.14696, 1 
6 | EUR/USD | 2018-10-19 11:36:45.131000 | 1.14696, 1 
7 | EUR/USD | 2018-10-19 11:36:45.247000 | 1.14696, 1 


ms'), 


- 14705 
- 14706 
- 14707 


14708 


. 14709 
. 14708 
- 14709 


14708 


The subscription to a specific real-time data stream; data is processed asynchro- 


nously as long as there is no “unsubscribe” event. 


© During the subscription, the .get_last_price() method returns the last avail- 


able data set. 


© This unsubscribes from the real-time data stream. 
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Callback Functions 


Callback functions are a flexible means to process real-time 
streaming data based on a Python function or even multiple such 
functions. They can be used for simple tasks, such as the printing 
of incoming data, or complex tasks, such as generating trading sig- 


nals based on online trading algorithms (see Chapter 16). 


Placing Orders 


The FXCM API allows the placement and management of all types of orders that are 
also available via the trading application of FKCM (such as entry orders or trailing 
stop loss orders).’ However, the following code illustrates basic market buy and sell 
orders only since they are in general sufficient to at least get started with algorithmic 
trading. It first verifies that there are no open positions, then opens different posi- 


tions (via the create_market_buy_order() method): 


In [44]: 
Out[44]: 


In [45]: 


In [46]: 


In [47]: 


Out[47]: 


In [48]: 


In [49]: 
Out[49]: 


api.get_open_positions() (1) 
Empty DataFrame 

Columns: [] 

Index: [] 


order = api.create_market_buy_order('EUR/USD', 10) 


sel = ['tradeId', 'amountK', 'currency', 
'grossPL', 'isBuy'] 


api.get_open_positions()[sel] © 
tradeId amountK currency grossPL isBuy 
© 132607899 10 EUR/USD 0.17436 True 


order = api.create_market_buy_order('EUR/GBP', 5) 


api.get_open_positions()[sel] 

tradeId amountK currency grossPL isBuy 
© 132607899 10 EUR/USD 0.17436 True 
1 132607928 5 EUR/GBP -1.53367 True 


@ Shows the open positions for the connected (default) account. 


@ Opens a position of 100,000 in the EUR/USD currency pair.’ 


2 See the documentation for details. 


3 Quantities are in thousands of the instrument for currency pairs. Also note that different accounts might have 
different leverage ratios. This implies that the same position might require more or less equity (margin) 
depending on the relevant leverage ratio. Adjust the example quantities to lower values if necessary. 
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© Shows the open positions for selected elements only. 


© Opens another position of 50,000 in the EUR/GBP currency pair. 


While the create_market_buy_order() function opens or increases positions, the 
create_market_sell_order() function allows one to close or decrease positions. 
There are also more general methods that allow the closing out of positions, as the 
following code illustrates: 


In [50]: 
In [51]: 


In [52]: 
Out[52]: 


In [53]: 


In [54]: 
Out[54]: 


In [55]: 


In [56]: 
Out[56]: 


order = api.create_market_sell_order('EUR/USD', 3) (13 


order = api.create_market_buy_order('EUR/GBP', 5) (2) 


api.get_open_positions()[sel] © 


tradeId amountK currency grossPL 
© 132607899 10 EUR/USD 0.17436 
1 132607928 5 EUR/GBP -1.53367 
2 132607930 3 EUR/USD -1.33369 
3 132607932 5 EUR/GBP -1.64728 


api.close_all_for_symbol('EUR/GBP') @ 


api.get_open_positions()[sel] 


tradeId amountK currency grossPL 
© 132607899 10 EUR/USD 0.17436 
1 132607930 3 EUR/USD -1.33369 


api.close_all() (5) 


api.get_open_positions() 
Empty DataFrame 

Columns: [] 

Index: [] 


isBuy 
True 
True 
False 
True 


isBuy 
True 
False 


This reduces the position in the EUR/USD currency pair. 


This increases the position in the EUR/GBP currency pair. 


For EUR/GBP there are now two open long positions; contrary to the EUR/USD 
position, they are not netted. 


The close_all() method closes all open positions. 


The close_all_for_symbol() method closes all positions for the specified 
symbol. 
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Account Information 


Beyond, for example, open positions, the FKCM API allows retrieval of more general 
account information as well. For example, one can look up the default account (if 
there are multiple accounts) or get an overview of the equity and margin situation: 


In [57]: api.get_default_account() (13 
Out[57]: 1090495 


In [58]: api.get_accounts().T (2) 


Out[58]: 0 
accountId 1090495 
accountName 01090495 
balance 4915.2 
dayPL -41.97 
equity 4915.2 
grossPL 0 
hedging Y 
mc N 
mcDate 
ratePrecision 0 
t 6 
usableMargin 4915.2 
usableMargin3 4915.2 
usableMargin3Perc 100 
usableMarginPerc 100 
usdMr 0 
usdMr3 0 


@ Shows the default accountId value. 


@ Shows for all accounts the financial situation and some parameters. 


Conclusion 
This chapter is about the REST API of FXCM for algorithmic trading and covers the 


following topics: 
e Setting everything up for API usage 
e Retrieving historical tick data 
e Retrieving historical candles data 
e Retrieving streaming data in real time 
e Placing market buy and sell orders 


e Looking up account information 
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The FXCM API and the fxcmpy wrapper package provide, of course, more function- 
ality, but these are the basic building blocks needed to get started with algorithmic 
trading. 


Further Resources 


For further details on the FKCM trading API and the Python wrapper package, con- 
sult the documentation: 


e Trading API 
e fxcmpy package 


For a comprehensive online training program covering Python for algorithmic trad- 
ing, see http://certificate.tpq. io. 
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CHAPTER 15 
Trading Strategies 


[T]hey were silly enough to think you can look at the past to predict the future. 


—The Economist! 


This chapter is about the vectorized backtesting of algorithmic trading strategies. The 
term algorithmic trading strategy is used to describe any type of financial trading 
strategy that is based on an algorithm designed to take long, short, or neutral posi- 
tions in financial instruments on its own without human interference. A simple algo- 
rithm, such as “altering every five minutes between a long and a neutral position in 
the stock of Apple, Inc.,” satisfies this definition. For the purposes of this chapter and 
a bit more technically, an algorithmic trading strategy is represented by some Python 
code that, given the availability of new data, decides whether to buy or sell a financial 
instrument in order to take long, short, or neutral positions in it. 


The chapter does not provide an overview of algorithmic trading strategies (see “Fur- 
ther Resources” on page 519 for references that cover algorithmic trading strategies in 
more detail). It rather focuses on the technical aspects of the vectorized backtesting 
approach for a select few such strategies. With this approach the financial data on 
which the strategy is tested is manipulated in general as a whole, applying vectorized 
operations on NumPy ndarray and pandas DataFrame objects that store the financial 
data.’ 


Another focus of the chapter is the application of machine and deep learning algo- 
rithms to formulate algorithmic trading strategies. To this end, classification 


1 Source: “Does the Past Predict the Future?” Economist.com, 23 September 2009, available at https:// 
www.economist.com/free-exchange/2009/09/23/does-the-past-predict-the-future. 


2 An alternative approach would be the event-based backtesting of trading strategies, during which the arrival of 
new data in markets is simulated by explicitly looping over every single new data point. 
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algorithms are trained on historical data in order to predict future directional market 
movements. This in general requires the transformation of the financial data from 
real values to a relatively small number of categorical values.’ This allows us to har- 
ness the pattern recognition power of such algorithms. 


The chapter is broken down into the following sections: 


“Simple Moving Averages” on page 484 
This section focuses on an algorithmic trading strategy based on simple moving 
averages and how to backtest such a strategy. 


“Random Walk Hypothesis” on page 491 
This section introduces the random walk hypothesis. 


“Linear OLS Regression” on page 494 
This section looks at using OLS regression to derive an algorithmic trading 
strategy. 


“Clustering” on page 499 
In this section, we explore using unsupervised learning algorithms to derive algo- 
rithmic trading strategies. 


“Frequency Approach” on page 501 
This section introduces a simple frequentist approach for algorithmic trading. 


“Classification” on page 504 
Here we look at classification algorithms from machine learning for algorithmic 
trading. 


“Deep Neural Networks” on page 512 
This section focuses on deep neural networks and how to use them for algorith- 
mic trading. 


Simple Moving Averages 


Trading based on simple moving averages (SMAs) is a decades-old trading approach 
(see, for example, the paper by Brock et al. (1992)). Although many traders use SMAs 
for their discretionary trading, they can also be used to formulate simple algorithmic 
trading strategies. This section uses SMAs to introduce vectorized backtesting of 
algorithmic trading strategies. It builds on the technical analysis example in Chap- 
ter 8. 


3 Note that when working with real values, every pattern might be unique or at least rather rare, which makes it 
difficult to train an algorithm and to conclude anything from an observed pattern. 
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Data Import 


First, some imports: 


In [1]: import numpy as np 
import pandas as pd 
import datetime as dt 
from pylab import mpl, plt 


In [2]: plt.style.use('seaborn') 
mpL.rcParams['font.family'] = 'serif' 


%matplotlib inline 


Second, the reading of the raw data and the selection of the financial time series for a 
single symbol, the stock of Apple, Inc. (AAPL.0). The analysis in this section is based 
on end-of-day data; intraday data is used in subsequent sections: 


In [3]: raw 


In [4]: raw. 


<class 'pandas.core.frame.DataFrame'> 


= pd.read_csv('../../source/tr_eikon_eod_data.csv', 


info() 


index_col=0, parse_dates=True) 


DatetimeIndex: 2216 entries, 2010-01-01 to 2018-06-29 


Data columns (total 12 columns): 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
non-null 


225.1 KB 


AAPL.O 2138 
MSFT .O 2138 
INTC.O 2138 
AMZN.O 2138 
GS.N 2138 
SPY 2138 
«SPX 2138 
.VIX 2138 
EUR= 2216 
XAU= 2211. 
GDX 2138 
GLD 2138 
dtypes: float64(12) 
memory usage: 


In [5]: symbol = 'AAPL.O' 


In [6]: data 
) 
Trading Strat 


= ( 


float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 
float64 


pd.DataFrame(raw[symbol]) 


.dropna() 


egy 


Third, the calculation of the SMA values for two different rolling window sizes. 


Figure 15-1 shows the three time series visually: 
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In [7]: SMA1 


In [8]: data['SMA1'] 


42 0 
SMA2 = 252 @ 


I 


data['SMA2'] 


data[symbol].rolling(SMA1) .mean() (1) 
data[symbol].rolling(SMA2).mean() (2) 


In [9]: data.plot(figsize=(10, 6)); 
@ Calculates the values for the shorter SMA. 
© Calculates the values for the longer SMA. 
200 
= AAPL.O 
— SMA1 
175 -== SMA2 
150 
125 
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75 
50 
25 
ONA as a” oe os” N N w Na 
Date 


Figure 15-1. Apple stock price and two simple moving averages 


Fourth, the derivation of the positions. The trading rules are: 


e Go long (= +1) when the shorter SMA is above the longer SMA. 


e Go short (= -1) when the shorter SMA is below the longer SMA.* 


The positions are visualized in Figure 15-2: 


In [10]: data.dropna(inplace=True) 


In [11]: data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, 


In [12]: data.tail() 


-1) 


4 Similarly, for a long only strategy one would use +1 for a long position and 0 for a neutral position. 
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Out[12]: AAPL.O SMA1 SMA2 Position 
Date 
2018-06-25 182.17 185.606190 168.265556 
2018-06-26 184.43 186.087381 168.418770 
2018-06-27 184.16 186.607381 168.579206 
2018-06-28 185.50 187.089286 168.736627 
2018-06-29 185.11 187.470476 168.901032 


PRP PR RP 


In [13]: ax = data.plot(secondary_y='Position', figsize=(10, 6)) 
ax.get_legend().set_bbox_to_anchor((0.25, 0.85)); 


@ np.where(cond, a, b) evaluates the condition cond element-wise and places a 
when True and b otherwise. 


200 
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140 — Position (right) 
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Date 


Figure 15-2. Apple stock price, two SMAs, and resulting positions 


This replicates the results derived in Chapter 8. What is not addressed there is if fol- 
lowing the trading rules—i.e., implementing the algorithmic trading strategy—is 
superior compared to the benchmark case of simply going long on the Apple stock 
over the whole period. Given that the strategy leads to two periods only during which 
the Apple stock should be shorted, differences in the performance can only result 
from these two periods. 


Vectorized Backtesting 


The vectorized backtesting can now be implemented as follows. First, the log returns 
are calculated. Then the positionings, represented as +1 or -1, are multiplied by the 
relevant log return. This simple calculation is possible since a long position earns the 
return of the Apple stock and a short position earns the negative return of the Apple 
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stock. Finally, the log returns for the Apple stock and the algorithmic trading strategy 
based on SMAs need to be added up and the exponential function applied to arrive at 
the performance values: 


In [14]: data['Returns'] = np.log(data[symbol] / data[symbol].shift(1)) (1) 
In [15]: data['Strategy'] = data['Position'].shift(1) * data['Returns'] (2) 


In [16]: data.round(4).head() 

Out[16]: AAPL.O SMA1 SMA2 Position Returns Strategy 
Date 
2010-12-31 46.0800 45.2810 37.1207 
2011-01-03 47.0814 45.3497 37.1862 
2011-01-04 47.3271 45.4126 37.2525 
2011-01-05 47.7142 45.4661 37.3223 
2011-01-06 47.6757 45.5226 37.3921 


NaN NaN 
0.0215 0.0215 
0.0052 0.0052 
0.0081 0.0081 
-0.0008 -0.0008 


PRPRPPRPP 


In [17]: data.dropna(inplace=True) 


In [18]: np.exp(data[['Returns', 'Strategy']].sum()) © 
Out[18]: Returns 4.017148 

Strategy 3.011299 

dtype: float64 


In [19]: data[['Returns', 'Strategy']].std() * 252 ** 0.5 (4) 
Out[19]: Returns 0.250571 

Strategy 0.250407 

dtype: float64 


Calculates the log returns of the Apple stock (i.e., the benchmark investment). 


Multiplies the position values, shifted by one day, by the log returns of the Apple 
stock; the shift is required to avoid a foresight bias.° 


© Sums up the log returns for the strategy and the benchmark investment and cal- 
culates the exponential value to arrive at the absolute performance. 


© Calculates the annualized volatility for the strategy and the benchmark invest- 
ment. 


The numbers show that the algorithmic trading strategy indeed outperforms the 
benchmark investment of passively holding the Apple stock. Due to the type and 
characteristics of the strategy, the annualized volatility is the same, such that it also 
outperforms the benchmark investment on a risk-adjusted basis. 


5 The basic idea is that the algorithm can only set up a position in the Apple stock given today’s market data 
(e.g., just before the close). The position then earns tomorrow’s return. 
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To gain a better picture of the overall performance, Figure 15-3 shows the perfor- 
mance of the Apple stock and the algorithmic trading strategy over time: 


In [20]: ax = data[['Returns', 'Strategy']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)) 
data['Position'].plot(ax=ax, secondary_y='Position', style='--') 
ax.get_lLegend().set_bbox_to_anchor((0.25, 0.85)); 
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Figure 15-3. Performance of Apple stock and SMA-based trading strategy over time 


Simplifications 


The vectorized backtesting approach as introduced in this subsec- 
tion is based on a number of simplifying assumptions. Among oth- 
ers, transactions costs (fixed fees, bid-ask spreads, lending costs, 
etc.) are not included. This might be justifiable for a trading strat- 
egy that leads to a few trades only over multiple years. It is also 
assumed that all trades take place at the end-of-day closing prices 
for the Apple stock. A more realistic backtesting approach would 
take these and other (market microstructure) elements into 
account. 


Optimization 


A natural question that arises is if the chosen parameters SMA1=42 and SMA2=252 are 
the “right” ones. In general, investors prefer higher returns to lower returns ceteris 
paribus. Therefore, one might be inclined to search for those parameters that maxi- 
mize the return over the relevant period. To this end, a brute force approach can be 
used that simply repeats the whole vectorized backtesting procedure for different 
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parameter combinations, records the results, and does a ranking afterward. This is 
what the following code does: 


In [21]: from itertools import product 


In [22]: smal = range(20, 61, 4) (1) 
sma2 = range(180, 281, 10) @ 


In [23]: results = pd.DataFrame() 
for SMA1, SMA2 in product(sma1, sma2): © 
data = pd.DataFrame(raw[symbol]) 
data.dropna(inplace=True) 
data['Returns'] = np.log(data[symbol] / data[symbol].shift(1)) 
data['SMA1'] = data[symbol].rolling(SMA1).mean() 
data['SMA2'] = data[symbol].rolling(SMA2).mean() 
data.dropna(inplace=True) 
data['Position'] = np.where(data['SMA1'] > data['SMA2'], 1, -1) 
data['Strategy'] = data['Position'].shift(1) * data['Returns'] 
data.dropna(inplace=True) 
perf = np.exp(data[['Returns', 'Strategy']].sum()) 
results = results.append(pd.DataFrame( 
{'SMA1': SMA1, 'SMA2': SMA2, 
"MARKET': perf['Returns'], 
"STRATEGY': perf['Strategy'], 
"OUT': perf['Strategy'] - perf['Returns']}, 
index=[0]), ignore_index=True) 


Specifies the parameter values for SMA1. 
Specifies the parameter values for SMA2. 


Combines all values for SMA1 with those for SMA2. 


© © 8 8 


Records the vectorized backtesting results in a DataFrame object. 


The following code gives an overview of the results and shows the seven best- 
performing parameter combinations of all those backtested. The ranking is imple- 
mented according to the outperformance of the algorithmic trading strategy 
compared to the benchmark investment. The performance of the benchmark invest- 
ment varies since the choice of the SMA2 parameter influences the length of the time 
interval and data set on which the vectorized backtest is implemented: 


In [24]: results.info() 
<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 121 entries, 0 to 120 
Data columns (total 5 columns): 


SMA1 121 non-null int64 
SMA2 121 non-null int64 
MARKET 121 non-null float64 


STRATEGY 121 non-null float64 
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OUT 121 non-null float64 
dtypes: float64(3), int64(2) 
memory usage: 4.8 KB 


In [25]: results.sort_values('OUT', ascending=False).head(7) 
Out[25]: SMA1 SMA2 MARKET STRATEGY OUT 


56 40 190 4.650342 7.175173 2.524831 
39 32 240 4.045619 6.558690 2.513071 
59 40 220 4.220272 6.544266 2.323994 
46 36 200 4.074753 6.389627 2.314874 
55 40 180 4.574979 6.857989 2.283010 
70 44 220 4.220272 6.469843 2.249571 
101 56 200 4.074753 6.319524 2.244772 


According to the brute force-based optimization, SMA1=40 and SMA2=190 are the opti- 
mal parameters, leading to an outperformance of some 230 percentage points. How- 
ever, this result is heavily dependent on the data set used and is prone to overfitting. 
A more rigorous approach would be to implement the optimization on one data set, 
the in-sample or training data set, and test it on another one, the out-of-sample or 
testing data set. 


Overfitting 


In general, any type of optimization, fitting, or training in the con- 
text of algorithmic trading strategies is prone to what is called over- 
fitting. This means that parameters might be chosen that perform 
(exceptionally) well for the used data set but might perform 
(exceptionally) badly on other data sets or in practice. 


Random Walk Hypothesis 


The previous section introduces vectorized backtesting as an efficient tool to backtest 
algorithmic trading strategies. The single strategy backtested based on a single finan- 
cial time series, namely historical end-of-day prices for the Apple stock, outperforms 
the benchmark investment of simply going long on the Apple stock over the same 
period. 


Although rather specific in nature, these results are in contrast to what the random 
walk hypothesis (RWH) predicts, namely that such predictive approaches should not 
yield any outperformance at all. The RWH postulates that prices in financial markets 
follow a random walk, or, in continuous time, an arithmetic Brownian motion 
without drift. The expected value of an arithmetic Brownian motion without drift at 
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any point in the future equals its value today.° As a consequence, the best predictor 
for tomorrow’s price, in a least-squares sense, is today’s price if the RWH applies. 


The consequences are summarized in the following quote: 


For many years, economists, statisticians, and teachers of finance have been interested 
in developing and testing models of stock price behavior. One important model that 
has evolved from this research is the theory of random walks. This theory casts serious 
doubt on many other methods for describing and predicting stock price behavior— 
methods that have considerable popularity outside the academic world. For example, 
we shall see later that, if the random-walk theory is an accurate description of reality, 
then the various “technical” or “chartist” procedures for predicting stock prices are 
completely without value. 


—Eugene F. Fama (1965) 


The RWH is consistent with the efficient markets hypothesis (EMH), which, non- 
technically speaking, states that market prices reflect “all available information.” Dif- 
ferent degrees of efficiency are generally distinguished, such as weak, semi-strong, and 
strong, defining more specifically what “all available information” entails. Formally, 
such a definition can be based on the concept of an information set in theory and on 
a data set for programming purposes, as the following quote illustrates: 


A market is efficient with respect to an information set S if it is impossible to make 
economic profits by trading on the basis of information set S. 


—Michael Jensen (1978) 


Using Python, the RWH can be tested for a specific case as follows. A financial time 
series of historical market prices is used for which a number of lagged versions are 
created—say, five. OLS regression is then used to predict the market prices based on 
the lagged market prices created before. The basic idea is that the market prices from 
yesterday and four more days back can be used to predict today’s market price. 


The following Python code implements this idea and creates five lagged versions of 
the historical end-of-day closing levels of the S&P 500 stock index: 


In [26]: symbol = '.SPX' 
In [27]: data = pd.DataFrame(raw[symbol]) 


In [28]: lags = 5 
cols] [] 
for lag in range(1, lags + 1): 
col = 'lag_{}'.format(lag) (1) 
data[col] = data[symbol].shift(lag) (2) 


6 For a formal definition and deeper discussion of random walks and Brownian motion-based processes, refer 
to Baxter and Rennie (1996). 
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cols.append(col) © 


In [29]: data.head(7) 


Out[29]: . SPX lag_1 lag_2 lag_3 lag_4 lag_5 
Date 
2010-01-01 NaN NaN NaN NaN NaN NaN 
2010-01-04 1132.99 NaN NaN NaN NaN NaN 
2010-01-05 1136.52 1132.99 NaN NaN NaN NaN 
2010-01-06 1137.14 1136.52 1132.99 NaN NaN NaN 
2010-01-07 1141.69 1137.14 1136.52 1132.99 NaN NaN 
2010-01-08 1144.98 1141.69 1137.14 1136.52 1132.99 NaN 


2010-01-11 1146.98 1144.98 1141.69 1137.14 1136.52 1132.99 
In [30]: data.dropna(inplace=True) 
@ Defines a column name for the current lag value. 


© Creates the lagged version of the market prices for the current lag value. 


© Collects the column names for later reference. 


Using NumPy, the OLS regression is straightforward to implement. As the optimal 
regression parameters show, lag_1 indeed is the most important one in predicting 
the market price based on OLS regression. Its value is close to 1. The other four val- 
ues are rather close to 0. Figure 15-4 visualizes the optimal regression parameter 
values. 
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Figure 15-4. Optimal regression parameters from OLS regression for price prediction 
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When using the optimal results to visualize the prediction values as compared to the 
original index values for the S&P 500, it becomes obvious from Figure 15-5 that 
indeed lag_1 is basically what is used to come up with the prediction value. Graphi- 
cally speaking, the prediction line in Figure 15-5 is the original time series shifted by 
one day to the right (with some minor adjustments). 
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Figure 15-5. S&P 500 levels compared to prediction values from OLS regression 


All in all, the brief analysis in this section reveals some support for both the RWH 
and the EMH. For sure, the analysis is done for a single stock index only and uses a 
rather specific parameterization—but this can easily be widened to incorporate mul- 
tiple financial instruments across multiple asset classes, different values for the num- 
ber of lags, etc. In general, one will find out that the results are qualitatively more or 
less the same. After all, the RWH and EMH are among the financial theories that 
have broad empirical support. In that sense, any algorithmic trading strategy must 
prove its worth by proving that the RWH does not apply in general. This for sure is a 
tough hurdle. 


Linear OLS Regression 


This section applies linear OLS regression to predict the direction of market move- 
ments based on historical log returns. To keep things simple, only two features are 
used. The first feature (lag_1) represents the log returns of the financial time series 
lagged by one day. The second feature (lag_2) lags the log returns by two days. Log 
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returns—in contrast to prices—are stationary in general, which often is a necessary 
condition for the application of statistical and ML algorithms. 


The basic idea behind the usage of lagged log returns as features is that they might be 
informative in predicting future returns. For example, one might hypothesize that 
after two downward movements an upward movement is more likely (“mean rever- 
sion”), or, to the contrary, that another downward movement is more likely 
(“momentum” or “trend”). The application of regression techniques allows the for- 
malization of such informal reasonings. 


The Data 


First, the importing and preparation of the data set. Figure 15-6 shows the frequency 
distribution of the daily historical log returns for the EUR/USD exchange rate. They 
are the basis for the features as well as the labels to be used in what follows: 


In [3]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True).dropna() 


In [4]: raw.columns 

Out[4]: Index(['AAPL.O', 'MSFT.O', 'INTC.O', 'AMZN.O', 'GS.N', 'SPY', '.SPX', 
” NIX"; “EURS",. "XAUS"; "GDX"; "GLD" Ti 
dtype='object') 


In [5]: symbol = 'EUR=' 

In [6]: data = pd.DataFrame(raw[symbol]) 

In [7]: data['returns'] = np.log(data / data.shift(1)) 

In [8]: data.dropna(inplace=True) 

In [9]: data['direction'] = np.sign(data['returns']).astype(int) 


In [10]: data.head() 

Out[10]: EUR= returns direction 
Date 
2010-01-05 1.4368 -0.002988 - 
2010-01-06 1.4412 0.003058 
2010-01-07 1.4318 -0.006544 - 
2010-01-08 1.4412 0.006544 
2010-01-11 1.4513 0.006984 


PRPPRP RP 


In [11]: data['returns'].hist(bins=35, figsize=(10, 6)); 
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Figure 15-6. Histogram of log returns for EUR/USD exchange rate 


Second, the code that creates the features data by lagging the log returns and visual- 
izes it in combination with the returns data (see Figure 15-7): 


In [12]: lags = 2 


In [13]: def create_lags(data): 
global cols 
cols = [] 
for lag in range(1, lags + 1): 
col = 'lag_{}'.format(lag) 
data[col] = data['returns'].shift(lag) 
cols.append(col) 


In [14]: create_lags(data) 


In [15]: data.head() 


Out[15]: EUR= returns direction lag_1 lag_2 
Date 
2010-01-05 1.4368 -0.002988 - NaN NaN 
2010-01-06 1.4412 0.003058 -0.002988 NaN 


-003058 -0.002988 
006544 0.003058 
-006544 -0.006544 


1 
T 
2010-01-07 1.4318 -0.006544 -1 
2010-01-08 1.4412 0.006544 1 
2010-01-11 1.4513 0.006984 1 
In [16]: data.dropna(inplace=True) 


In [17]: data.plot.scatter(x='lag_1', y='lag_2', c='returns', 
cmap='coolwarm', figsize=(10, 6), colorbar=True) 
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plt.axvline(@, c='r', ls='--') 


plt.axhline(0, c='r', ls='--'); 
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Figure 15-7. Scatter plot based on features and labels data 


Regression 


With the data set completed, linear OLS regression can be applied to learn about any 
potential (linear) relationships, to predict market movement based on the features, 
and to backtest a trading strategy based on the predictions. Two basic approaches are 
available: using the log returns or only the direction data as the dependent variable 
during the regression. In any case, predictions are real-valued and therefore trans- 
formed to either +1 or -1 to only work with the direction of the prediction: 


In [18]: from sklearn.linear_model import LinearRegression (1) 
In [19]: model = LinearRegression() 1) 


In [20]: data['pos_ols_1'] = model. fit(data[cols], 
data['returns']).predict(data[cols]) (2) 


In [21]: data['pos_ols_2'] = model.fit(data[cols], 
data['direction']).predict(data[cols]) © 


In [22]: data[['pos_ols_1', 'pos_ols_2']].head() 
Out[22]: pos_ols_1 pos_ols_2 
Date 
2010-01-07 -0.000166 -0.000086 
2010-01-08 0.000017 0.040404 
2010-01-11 -0.000244 -0.011756 
2010-01-12 -0.000139 -0.043398 
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2010-01-13 -0.000022 0.002237 


In [23]: data[['pos_ols_1', 'pos_ols_2']] = np.where( 
data[['pos_ols_1', 'pos_ols_2']] > 0, 1, -1) (4) 

In [24]: data['pos_ols_1'].value_counts() (5) 
Out[24]: -1 1847 

1 288 

Name: pos_ols_1, dtype: int64 
In [25]: data['pos_ols_2'].value_counts() (5) 
Out[25]: 1 1377 

-1 758 

Name: pos_ols_2, dtype: int64 


In [26]: (data['pos_ols_1'].diff() != 0).sum() 6] 
Out[26]: 555 


In [27]: (data['pos_ols_2'].diff() != 0).sum() Q 
Out[27]: 762 


The linear OLS regression implementation from scikit-learn is used. 
The regression is implemented on the log returns directly ... 
... and on the direction data which is of primary interest. 


The real-valued predictions are transformed to directional values (+1, -1). 
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The two approaches yield different directional predictions in general. 


However, both lead to a relatively large number of trades over time. 


Equipped with the directional prediction, vectorized backtesting can be applied to 
judge the performance of the resulting trading strategies. At this stage, the analysis is 
based on a number of simplifying assumptions, such as “zero transaction costs” and 
the usage of the same data set for both training and testing. Under these assumptions, 
however, both regression-based strategies outperform the benchmark passive invest- 
ment, while only the strategy trained on the direction of the market shows a positive 
overall performance (Figure 15-8): 


In [28]: data['strat_ols_1'] = data['pos_ols_1'] * data['returns'] 

In [29]: data['strat_ols_2'] = data['pos_ols_2'] * data['returns'] 

In [30]: data[['returns', 'strat_ols_1', 'strat_ols_2']].sum().apply(np.exp) 
Out[30]: returns 0.810644 


strat_ols_1 0.942422 
strat_ols_2 1.339286 
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dtype: float64 


In [31]: (data['direction'] == data['pos_ols_1']).value_counts() (1) 
Out[31]: False 1093 

True 1042 

dtype: int64 


In [32]: (data['direction'] == data['pos_ols_2']).value_counts() (1) 
Out[32]: True 1096 

False 1039 

dtype: int64 


In [33]: data[['returns', 'strat_ols_1', 'strat_ols_2']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 


@ Shows the number of correct and false predictions by the strategies. 
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Figure 15-8. Performance of EUR/USD and regression-based strategies over time 


Clustering 


This section applies k-means clustering, as introduced in “Machine Learning” on 
page 444, to financial time series data to automatically come up with clusters that are 
used to formulate a trading strategy. The idea is that the algorithm identifies two 
clusters of feature values that predict either an upward movement or a downward 
movement. 


The following code applies the k-means algorithm to the two features as used before. 
Figure 15-9 visualizes the two clusters: 
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In [34]: from sklearn.cluster import KMeans 

In [35]: model = KMeans(n_clusters=2, random_state=0) (1) 

In [36]: model.fit(data[cols]) 

Out[36]: KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, 
n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto', 
random_state=0, tol=0.0001, verbose=0) 

In [37]: data['pos_clus'] = model.predict(data[cols]) 

In [38]: data['pos_clus'] = np.where(data['pos_clus'] == 1, -1, 1) (2) 


In [39]: data['pos_clus'].values 
Out[39]: array([-1, 1, -1, ..., 1, 1, -1]) 


In [40]: plt.figure(figsize=(10, 6)) 
plt.scatter(data[cols].iloc[:, 0], data[cols].iloc[:, 1], 
c=data['pos_clus'], cmap='coolwarm'); 


@ Two clusters are chosen for the algorithm. 


© Given the cluster values, the position is chosen. 
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Figure 15-9. Two clusters as identified by the k-means algorithm 


Admittedly, this approach is quite arbitrary in this context—after all, how should the 
algorithm know what one is looking for? However, the resulting trading strategy 
shows a slight outperformance at the end compared to the benchmark passive invest- 
ment (see Figure 15-10). It is noteworthy that no guidance (supervision) is given and 
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that the hit ratio—i.e., the number of correct predictions in relationship to all predic- 
tions made—is less than 50%: 


In [41]: data['strat_clus'] = data['pos_clus'] * data['returns'] 


In [42]: data[['returns', 'strat_clus']].sum().apply(np.exp) 
Out[42]: returns 0.810644 

strat_clus 1.277133 

dtype: float64 


In [43]: (data['direction'] == data['pos_clus']).value_counts() 
Out[43]: True 1077 

False 1058 

dtype: int64 


In [44]: data[['returns', 'strat_clus']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 
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Figure 15-10. Performance of EUR/USD and k-means-based strategy over time 


Frequency Approach 


Beyond more sophisticated algorithms and techniques, one might come up with the 
idea of just implementing a frequency approach to predict directional movements in 
financial markets. To this end, one might transform the two real-valued features to 
binary ones and assess the probability of an upward and a downward movement, 
respectively, from the historical observations of such movements, given the four pos- 
sible combinations for the two binary features ((0, 0), (0, 1), (1, 0), (1, 1)). 
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Making use of the data analysis capabilities of pandas, such an approach is relatively 
easy to implement: 


In [45]: def create_bins(data, bins=[0]): 
global cols_bin 
cols_bin = [] 
for col in cols: 
col_bin = col + '_bin' 
data[col_bin] = np.digitize(data[col], bins=bins) (13 
cols_bin.append(col_bin) 


In [46]: create_bins(data) 


In [47]: data[cols_bin + ['direction']].head() (2) 


Out[47]: lag_1_bin lag_2_bin direction 
Date 
2010-01-07 1 (0) -1 
2010-01-08 0 i 1 
2010-01-11 1 0 4 
2010-01-12 2 T -1 
2010-01-13 0 i 1 


In [48]: grouped = data.groupby(cols_bin + ['direction']) 
grouped.size() © 

Out[48]: lag_1_bin lag_2_bin direction 
0 0 -1 239 


POrRPRPORRHPRHO 
BR 


251 
dtype: int64 


In [49]: res = grouped['direction'].size().unstack(fill_value=0) (4) 
In [50]: def highlight_max(s): 
is_max = s == s.max() 


return ['background-color: yellow' if v else '' for v in is_max] (5) 


In [51]: res.style.apply(highlight_max, axis=1) (5) 
Out[51]: <pandas.io.formats.style.Styler at 0x1a194216a0> 


@ Digitizes the feature values given the bins parameter. 


© Shows the digitized feature values and the label values. 
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© Shows the frequency of the possible movements conditional on the feature value 
combinations. 


© Transforms the DataFrame object to have the frequencies in columns. 


© Highlights the highest-frequency value per feature value combination. 


Given the frequency data, three feature value combinations hint at a downward 
movement while one lets an upward movement seem more likely. This translates into 
a trading strategy the performance of which is shown in Figure 15-11: 


In [52]: data['pos_freq'] = np.where(data[cols_bin].sum(axis=1) == 2, -1, 1) (17 


In [53]: (data['direction'] == data['pos_freq']).value_counts() 
Out[53]: True 1102 

False 1033 

dtype: int64 


In [54]: data['strat_freq'] = data['pos_freq'] * data['returns'] 
In [55]: data[['returns', 'strat_freq']].sum().apply(np.exp) 
Out[55]: returns 0.810644 

strat_freq 0.989513 

dtype: float64 


In [56]: data[['returns', 'strat_freq']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 


@ Translates the findings given the frequencies to a trading strategy. 
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Figure 15-11. Performance of EUR/USD and frequency-based trading strategy over 
time 


Classification 


This section applies the classification algorithms from ML (as introduced in 
“Machine Learning” on page 444) to the problem of predicting the direction of price 
movements in financial markets. With that background and the examples from pre- 
vious sections, the application of the logistic regression, Gaussian Naive Bayes, and 
support vector machine approaches is as straightforward as applying them to smaller 
sample data sets. 


Two Binary Features 


First, a fitting of the models based on the binary feature values and the derivation of 
the resulting position values: 


In [57]: from sklearn import linear_model 
from sklearn.naive_bayes import GaussianNB 
from sklearn.svm import SVC 


In [58]: C = 1 


In [59]: models = { 
"log_reg': lLinear_model.LogisticRegression(C=C), 
"gauss_nb': GaussianNB(), 
"svm': SVC(C=C) 


504 | Chapter 15: Trading Strategies 


In [60]: def fit_models(data): (1) 
mfit = {model: models[model].fit(data[cols_bin], 
data['direction']) 
for model in models.keys()} 


In [61]: fit_models(data) 
In [62]: def derive_positions(data): (2) 
for model in models.keys(): 
data['pos_' + model] = models[model].predict(data[cols_bin]) 
In [63]: derive_positions(data) 


© A function that fits all models. 


© A function that derives all position values from the fitted models. 


Second, the vectorized backtesting of the resulting trading strategies. Figure 15-12 
visualizes the performance over time: 


In [64]: def evaluate_strats(data): 1] 


global sel 

sel = [] 

for model in models.keys(): 
col = 'strat_' + model 


data[col] = data['pos_' + model] * data['returns'] 
sel.append(col) 
sel.insert(0, ‘returns') 
In [65]: evaluate_strats(data) 
In [66]: sel.insert(1, 'strat_freq') 


In [67]: data[sel].sum().apply(np.exp) (2) 


Out[67]: returns 0.810644 
strat_freq 0.989513 
strat_log_reg 1.243322 
strat_gauss_nb 1.243322 
strat_svm 0.989513 


dtype: float64 
In [68]: data[sel].cumsum().apply(np.exp).plot(figsize=(10, 6)); 
@ A function that evaluates all resulting trading strategies. 


© Some strategies might show the exact same performance. 
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Figure 15-12. Performance of EUR/USD and classification-based trading strategies 
(two binary lags) over time 


Five Binary Features 


In an attempt to improve the strategies’ performance, the following code works with 
five binary lags instead of two. In particular, the performance of the SVM-based 
strategy is significantly improved (see Figure 15-13). On the other hand, the perfor- 
mance of the LR- and GNB-based strategies is worse: 


In [69]: data = pd.DataFrame(raw[symbol]) 
In [70]: data['returns'] = np.log(data / data.shift(1)) 
In [71]: data['direction'] = np.sign(data['returns']) 
In [72]: lags = 5 1) 

create_lags(data) 

data.dropna(inplace=True) 
In [73]: create_bins(data) (2) 

cols_bin 


Out[73]: ['lag_1_bin', 'lag_2_bin', 'lag_3_bin', 'lag_4 bin', 'lag_5 bin"] 


In [74]: data[cols_bin].head() 


Out[74]: lag_1_bin lag_2_bin lag_3_bin lag_4 bin lag_5_bin 
Date 
2010-01-12 l T 0 1 0 
2010-01-13 0 1 1 0 1 
2010-01-14 ai 0 1 1 0 
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2010-01-15 0 1 0 1 1 


2010-01-19 0 0 l 0 1 
In [75]: data.dropna(inplace=True) 
In [76]: fit_models(data) 
In [77]: derive_positions(data) 
In [78]: evaluate_strats(data) 
In [79]: data[sel].sum().apply(np.exp) 
Out[79]: returns 0.805002 
strat_log_reg 0.971623 
strat_gauss_nb 0.986420 
strat_svm 1.452406 
dtype: float64 
In [80]: data[sel].cumsum().apply(np.exp).plot(figsize=(10, 6)); 


@ Five lags of the log returns series are now used. 


© The real-valued features data is transformed to binary data. 
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Figure 15-13. Performance of EUR/USD and classification-based trading strategies 
(five binary lags) over time 


Classification | 


507 


Five Digitized Features 


Finally, the following code uses the first and second moment of the historical log 
returns to digitize the features data, allowing for more possible feature value combi- 
nations. This improves the performance of all classification algorithms used, but for 
SVM the improvement is again most pronounced (see Figure 15-14): 


In [81]: mu = data['returns'].mean() (1) 
v = data['returns'].std() (2) 


In [82]: bins = [mu - v, mu, mu + v] (3) 
bins © 
Out[82]: [-0.006033537040418665, -0.00010174015279231306, 0.005830056734834039 ] 


In [83]: create_bins(data, bins) 


In [84]: data[cols_bin].head() 


Out[84]: lag_1_bin lag_2_bin lag_3 bin lag_4 bin lag_5 bin 
Date 
2010-01-12 3 3 0 2 1 
2010-01-13 1 3 3 0 2 
2010-01-14 2 t 3 3 0 
2010-01-15 í 2 1 3 3 
2010-01-19 0 T 2 1 3 


In [85]: fit_models(data) 
In [86]: derive_positions(data) 
In [87]: evaluate_strats(data) 


In [88]: data[sel].sum().apply(np.exp) 


Out[88]: returns 0.805002 
strat_log_reg 1.431120 
strat_gauss_nb 1.815304 
strat_svm 5.653433 


dtype: float64 
In [89]: data[sel].cumsum().apply(np.exp).plot(figsize=(10, 6)); 
@ The mean log return and ... 


© ... the standard deviation are used ... 


© ... to digitize the features data. 
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Figure 15-14. Performance of EUR/USD and classification-based trading strategies 
(five digitized lags) over time 


Types of Features 


This chapter exclusively works with lagged return data as features 
data, mostly in binarized or digitized form. This is mainly done for 
convenience, since such features data can be derived from the 
financial time series itself. However, in practical applications the 
features data can be gained from a wealth of different data sources 
and might include other financial time series and statistics derived 
thereof, macroeconomic data, company financial indicators, or 
news articles. Refer to López de Prado (2018) for an in-depth dis- 
cussion of this topic. There are also Python packages for automated 
time series feature extraction available, such as tsfresh. 


Sequential Train-Test Split 


To better judge the performance of the classification algorithms, the code that follows 
implements a sequential train-test split. The idea here is to simulate the situation 
where only data up to a certain point in time is available on which to train an ML 
algorithm. During live trading, the algorithm is then faced with data it has never seen 
before. This is where the algorithm must prove its worth. In this particular case, all 
classification algorithms outperform—under the simplified assumptions from before 
—the passive benchmark investment, but only the GNB and LR algorithms achieve a 
positive absolute performance (Figure 15-15): 
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In [90]: split = int(len(data) * 0.5) 

In [91]: train = data.iloc[:split].copy() (1) 

In [92]: fit_models(train) 1] 

In [93]: test = data.iloc[split: ].copy() (2) 

In [94]: derive_positions(test) (2) 

In [95]: evaluate_strats(test) (2) 

In [96]: test[sel].sum().apply(np.exp) 

Out[96]: returns 0.850291 
strat_log_reg 0.962989 
strat_gauss_nb 0.941172 
strat_svm 1.048966 
dtype: float64 

In [97]: test[sel].cumsum().apply(np.exp).plot(figsize=(10, 6)); 


© Trains all classification algorithms on the training data. 


@ Tests all classification algorithms on the test data. 
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Figure 15-15. Performance of EUR/USD and classification-based trading strategies 
(sequential train-test split) 
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Randomized Train-Test Split 


The classification algorithms are trained and tested on binary or digitized features 
data. The idea is that the feature value patterns allow a prediction of future market 
movements with a better hit ratio than 50%. Implicitly, it is assumed that the pat- 
terns’ predictive power persists over time. In that sense, it shouldn’t make (too much 
of) a difference on which part of the data an algorithm is trained and on which part 
of the data it is tested—implying that one can break up the temporal sequence of the 
data for training and testing. 


A typical way to do this is a randomized train-test split to test the performance of the 
classification algorithms out-of-sample—again trying to emulate reality, where an 
algorithm during trading is faced with new data on a continuous basis. The approach 
used is the same as that applied to the sample data in “Train-test splits: Support vec- 
tor machines” on page 459. Based on this approach, the SVM algorithm shows again 
the best performance out-of-sample (see Figure 15-16): 


In [98]: from sklearn.model_selection import train_test_spLit 


In [99]: train, test = train_test_split(data, test_size=0.5, 
shuffle=True, random_state=100) 


In [100]: train = train.copy().sort_index() (1) 


In [101]: train[cols_bin].head() 


Out[101]: lag_1_bin lag_2_bin lag _3_bin lag_4 bin lag_5 bin 
Date 
2010-01-12 2 3 0 F4 1 
2010-01-13 1 3 3 0 2 
2010-01-14 2 1 3 3 0 
2010-01-15 4 2 1 3 3 
2010-01-20 1 0 1 2 1 


In [102]: test = test.copy().sort_index() (1) 


In [103]: 


fit_models(train) 


In [104]: derive_positions(test) 


In [105]: evaluate_strats(test) 
In [106]: test[sel].sum().apply(np.exp) 
Out[106]: returns 0.878078 


strat_log_reg 0.735893 
strat_gauss_nb 0.765009 
strat_svm 0.695428 
dtype: float64 


In [107]: test[sel].cumsum().apply(np.exp).plot(figsize=(10, 6)); 


Classification | 511 


@ Train and test data sets are copied and brought back in temporal order. 
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Figure 15-16. Performance of EUR/USD and classification-based trading strategies 
(randomized train-test split) 


Deep Neural Networks 


Deep neural networks (DNNs) try to emulate the functioning of the human brain. 
They are in general composed of an input layer (the features), an output layer (the 
labels), and a number of hidden layers. The presence of hidden layers is what makes a 
neural network deep. It allows it to learn more complex relationships and to perform 
better on a number of problem types. When applying DNNs one generally speaks of 
deep learning instead of machine learning. For an introduction to this field, refer to 
Géron (2017) or Gibson and Patterson (2017). 


DNNs with scikit-learn 


This section applies the MLPClassifier algorithm from scikit- learn, as introduced 
in “Deep neural networks” on page 454. First, it is trained and tested on the whole 
data set, using the digitized features. The algorithm achieves exceptional performance 
in-sample (see Figure 15-17), which illustrates the power of DNNs for this type of 
problem. It also hints at strong overfitting, since the performance indeed seems unre- 
alistically good: 


In [108]: from sklearn.neural_network import MLPCLlassifier 


In [109]: model = MLPClassifier(solver='lbfgs', alpha=1e-5, 
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hidden_layer_sizes=2 * [250], 
random_state=1) 


In [110]: %time model. fit(data[cols_bin], data['direction']) 
CPU times: user 16.1 s, sys: 156 ms, total: 16.2 s 
Wall time: 9.85 s 


Out[110]: MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', 
beta_1=0.9, 

beta_2=0.999, early_stopping=False, epsilon=1e-08, 
hidden_lLayer_sizes=[250, 250], learning_rate='constant', 
learning_rate_init=0.001, max_iter=200, momentum=0.9, 
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, 
random_state=1, shuffle=True, solver='lLbfgs', tol=0.0001, 
validation_fraction=0.1, verbose=False, warm_start=False) 


In [111]: data['pos_dnn_sk'] = model.predict(data[cols_bin]) 
In [112]: data['strat_dnn_sk'] = data['pos_dnn_sk'] * data['returns'] 
In [113]: data[['returns', 'strat_dnn_sk']].sum().apply(np.exp) 
Out[113]: returns 0.805002 

strat_dnn_sk 35.156677 

dtype: float64 


In [114]: data[['returns', 'strat_dnn_sk']].cumsum().apply( 
np.exp).plot(figsize=(10, 6)); 
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Figure 15-17. Performance of EUR/USD and DNN-based trading strategy (scikit-learn, 
in-sample) 
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To avoid overfitting of the DNN model, a randomized train-test split is applied next. 
The algorithm again outperforms the passive benchmark investment and achieves a 
positive absolute performance (Figure 15-18). However, the results seem more realis- 
tic now: 


In [115]: train, test = train_test_split(data, test_size=0.5, 
random_state=100) 


In [116]: train = train.copy().sort_index() 
In [117]: test = test.copy().sort_index() 


In [118]: model = MLPClassifier(solver='lbfgs', alpha=1e-5, max_iter=500, 
hidden_layer_sizes=3 * [500], random_state=1) (13 


In [119]: %time model.fit(train[cols_bin], train['direction']) 
CPU times: user 2min 26s, sys: 1.02 s, total: 2min 27s 
Wall time: 1min 31s 


Out[119]: MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', 
beta_1=0.9, 

beta_2=0.999, early_stopping=False, epsilon=1e-08, 
hidden_lLayer_sizes=[500, 500, 500], learning_rate='constant', 
learning_rate_init=0.001, max_iter=500, momentum=0.9, 
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, 
random_state=1, shuffle=True, solver='lbfgs', tol=0.0001, 
validation_fraction=0.1, verbose=False, warm_start=False) 


In [120]: test['pos_dnn_sk'] = model.predict(test[cols_bin]) 
In [121]: test['strat_dnn_sk'] = test['pos_dnn_sk'] * test['returns'] 
In [122]: test[['returns', 'strat_dnn_sk']].sum().apply(np.exp) 
Out[122]: returns 0.878078 

strat_dnn_sk 1.242042 

dtype: float64 


In [123]: test[['returns', 'strat_dnn_sk']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 


@ Increases the number of hidden layers and hidden units. 
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Figure 15-18. Performance of EUR/USD and DNN-based trading strategy (scikit-learn, 
randomized train-test split) 


DNNs with TensorFlow 


TensorF low has become a popular package for deep learning. It is developed and sup- 
ported by Google Inc. and applied there to a great variety of machine learning prob- 
lems. Zedah and Ramsundar (2018) cover TensorFlow for deep learning in depth. 


As with scikit- learn, the application of the DNNClassifier algorithm from Tensor 
Flow to derive an algorithmic trading strategy is straightforward given the back- 
ground from “Deep neural networks” on page 454. The training and test data is the 
same as before. First, the training of the model. In-sample, the algorithm outperforms 
the passive benchmark investment and shows a considerable absolute return (see 
Figure 15-19), again hinting at overfitting: 


In [124]: import tensorflow as tf 
tf. logging.set_verbosity(tf. logging. ERROR) 


In [125]: fc = [tf.contrib. layers.real_valued_column('lags', dimension=lLags) ] 


In [126]: model = tf.contrib. learn.DNNCLassifier(hidden_units=3 * [500], 
n_classes=Llen(bins) + 1, 
feature_columns=fc) 


In [127]: def input_fn(): 
fc = {'lags': tf.constant(data[cols_bin].values)} 
la = tf.constant(data['direction'].apply( 
lambda x: 0 if x < 0 else 1).values, 
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In [128]: 


Out[128]: 


In [129]: 
Out[129]: 


In [130]: 
Out[130]: 
In [131]: 
In [132]: 


In [133]: 
Out[133]: 


In [134]: 


shape=[data['direction'].size, 1]) 
return fc, la 


%time model. fit(input_fn=input_fn, steps=250) (1) 
CPU times: user 2min 7s, sys: 8.85 s, total: 2min 16s 
Wall time: 49 s 


DNNClassifier(params={'head' : 
<tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead 
object at 0x1a19acf898>, 'hidden_units': [500, 500, 500], 
'feature_columns': (_RealValuedColumn(column_name='lags', dimension=5, 
default_value=None, dtype=tf.float32, normalizer=None),), 'optimizer': 
None, 'activation_fn': <function relu at 0x1161441e0>, 'dropout': 
None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 
'input_Layer_min_slice_size': None}) 


model.evaluate(input_fn=input_fn, steps=1) (2) 
{'loss': 0.6879357, ‘'accuracy': 0.5379925, 'global_step': 250} 


pred = np.array(list(model.predict(input_fn=input_fn) )) (2] 
pred[:10] @ 
array([0, 0, 0; ©, O, 1, 0, 1, 1, 0]) 


data['pos_dnn_tf'] = np.where(pred > 0, 1, -1) © 
data['strat_dnn_tf'] = data['pos_dnn_tf'] * data['returns'] 
data[['returns', 'strat_dnn_tf']].sum().apply(np.exp) 
returns 0.805002 

strat_dnn_tf 2.437222 

dtype: float64 


data[['returns', 'strat_dnn_tf']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 


@ The time needed for training might be considerable. 


© The binary predictions (0, 1) ... 


© 


... need to be transformed to market positions (-1, +1). 
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Figure 15-19. Performance of EUR/USD and DNN-based trading strategy (TensorFlow, 


in-sample) 


The following code again implements a randomized train-test split to get a more real- 
istic view of the performance of the DNN-based algorithmic trading strategy. The 
performance is, as expected, worse out-of-sample (see Figure 15-20). In addition, 
given the specific parameterization the TensorFlow DNNClassifier underperforms 
the scikit-learn MLPClassifier algorithm by quite few percentage points: 


In [135] 


In [136]: 


In [137]: 


Out[137]: 


In [138] 


: model = tf.contrib.learn.DNNCLassifier(hidden_units=3 * [500], 
n_classes=len(bins) + 1, 
feature_columns=fc) 


data = train 


%time model. fit(input_fn=input_fn, steps=2500) 
CPU times: user 11min 7s, sys: 1min 7s, total: 12min 15s 
Wall time: 4min 27s 


DNNCLassifier(params={'head': 
<tensorflow.contrib. learn.python. learn.estimators.head._MultiClassHead 
object at 0x116828ccO0>, 'hidden_units': [500, 500, 500], 
'feature_columns': (_RealValuedColumn(column_name='Lags', dimension=5, 
default_value=None, dtype=tf.float32, normalizer=None),), ‘optimizer': 
None, 'activation_fn': <function relu at 0x1161441e0>, 'dropout': 
None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 
'input_Layer_min_slice_size': None}) 


: data = test 
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In [139]: 


model.evaluate(input_fn=input_fn, steps=1) 


Out[139]: {'loss': 0.82882184, 'accuracy': 0.48968107, 'global_step': 2500} 
In [140]: pred = np.array(list(model.predict(input_fn=input_fn))) 
In [141]: test['pos_dnn_tf'] = np.where(pred > 0, 1, -1) 
In [142]: test['strat_dnn_tf'] = test['pos_dnn_tf'] * test['returns'] 
In [143]: test[['returns', 'strat_dnn_sk', 'strat_dnn_tf']].sum().apply(np.exp) 
Out[143]: returns 0.878078 
strat_dnn_sk 1.242042 
strat_dnn_tf 1.063968 
dtype: float64 
In [144]: test[['returns', 'strat_dnn_sk', 'strat_dnn_tf']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 
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Figure 15-20. Performance of EUR/USD and DNN-based trading strategy (TensorFlow, 
randomized train-test split) 
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Performance Results 


All performance results shown for the different algorithmic trading 
strategies from vectorized backtesting so far are illustrative only. 
Beyond the simplifying assumption of no transaction costs, the 
results depend on a number of other (mostly arbitrarily chosen) 
parameters. They also depend on the relative small end-of-day 
price data set used throughout for the EUR/USD exchange rate. 
The focus lies on illustrating the application of different 
approaches and ML algorithms to financial data, not on deriving a 
robust algorithmic trading strategy to be deployed in practice. The 
next chapter addresses some of these issues. 


Conclusion 


This chapter is about algorithmic trading strategies and judging their performance 
based on vectorized backtesting. It starts with a rather simple algorithmic trading 
strategy based on two simple moving averages, a type of strategy known and used in 
practice for decades. This strategy is used to illustrate vectorized backtesting, making 
heavy use of the vectorization capabilities of NumPy and pandas for data analysis. 


Using OLS regression, the chapter also illustrates the random walk hypothesis on the 
basis of a real financial time series. This is the benchmark against which any algorith- 
mic trading strategy must prove its worth. 


The core of the chapter is the application of machine learning algorithms, as intro- 
duced in “Machine Learning” on page 444. A number of algorithms, the majority of 
which are of classification type, are used and applied based on mostly the same 
“rhythm.” As features, lagged log returns data is used in a number of variants— 
although this is a restriction that for sure is not necessary. It is mostly done for con- 
venience and simplicity. In addition, the analysis is based on a number of simplifying 
assumptions since the focus is mainly on the technical aspects of applying machine 
learning algorithms to financial time series data to predict the direction of financial 
market movements. 


Further Resources 
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e Brock, William, Josef Lakonishok, and Blake LeBaron (1992). “Simple Technical 
Trading Rules and the Stochastic Properties of Stock Returns.” Journal of 
Finance, Vol. 47, No. 5, pp. 1731-1764. 

e Fama, Eugene (1965). “Random Walks in Stock Market Prices.” Selected Papers, 
No. 16, Graduate School of Business, University of Chicago. 
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land: Cambridge University Press. 


Chan, Ernest (2009). Quantitative Trading. Hoboken, NJ: John Wiley & Sons. 
Chan, Ernest (2013). Algorithmic Trading. Hoboken, NJ: John Wiley & Sons. 
Chan, Ernest (2017). Machine Trading. Hoboken, NJ: John Wiley & Sons. 


Lopez de Prado, Marcos (2018). Advances in Financial Machine Learning. Hobo- 
ken, NJ: John Wiley & Sons. 


Technology books covering topics relevant to this chapter include: 


Albon, Chris (2018). Machine Learning with Python Cookbook. Sebastopol, CA: 
O'Reilly. 

Géron, Aurélien (2017). Hands-On Machine Learning with Scikit-Learn and 
Tensorflow. Sebastopol, CA: O’Reilly. 

Gibson, Adam, and Josh Patterson (2017). Deep Learning. Sebastopol, CA: 
O'Reilly. 

VanderPlas, Jake (2016). Python Data Science Handbook. Sebastopol, CA: 
O'Reilly. 
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For a comprehensive online training program covering Python for algorithmic trad- 
ing, see http://certificate.tpq. io. 
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CHAPTER 16 
Automated Trading 


People worry that computers will get too smart and take over the world, but the real 
problem is that they’re too stupid and they’ve already taken over the world. 


—Pedro Domingos 


“Now what?” one might think. A trading platform is available that allows one to 
retrieve historical data and streaming data, to place buy and sell orders, and to check 
the account status. A number of different methods have been introduced to derive 
algorithmic trading strategies by predicting the direction of market price movements. 
How can this all be put together to work in automated fashion? This question cannot 
be answered in any generality. However, this chapter addresses a number of topics 
that are important in this context. The chapter assumes that a single automated algo- 
rithmic trading strategy only shall be deployed. This simplifies, among others, aspects 
like capital and risk management. 


The chapter covers the following topics: 


“Capital Management” on page 522 
As this section demonstrates, depending on the strategy characteristics and the 
trading capital available, the Kelly criterion helps with sizing the trades. 


“ML-Based Trading Strategy” on page 532 
To gain confidence in an algorithmic trading strategy, the strategy needs to be 
backtested thoroughly both with regard to performance and risk characteristics; 
the example strategy used is based on a classification algorithm from machine 
learning as introduced in Chapter 15. 


“Online Algorithm” on page 544 
To deploy the algorithmic trading strategy for automated trading, it needs to be 
translated into an online algorithm that works with incoming streaming data in 
real time. 
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“Infrastructure and Deployment” on page 546 
To run automated algorithmic trading strategies robustly and reliably, deploy- 
ment in the cloud is the preferred option from an availability, performance, and 
security point of view. 


“Logging and Monitoring” on page 547 
To be able to analyze the history and certain events during the deployment of an 
automated trading strategy, logging plays an important role; monitoring via 
socket communication allows one to observe events (remotely) in real time. 


Capital Management 


A central question in algorithmic trading is how much capital to deploy to a given 
algorithmic trading strategy given the total available capital. The answer to this ques- 
tion depends on the main goal one is trying to achieve by algorithmic trading. Most 
individuals and financial institutions will agree that the maximization of long-term 
wealth is a good candidate objective. This is what Edward Thorpe had in mind when 
he derived the Kelly criterion for investing, as described in the paper by Rotando and 
Thorp (1992). 


The Kelly Criterion in a Binomial Setting 


The common way of introducing the theory of the Kelly criterion for investing is on 
the basis of a coin tossing game, or more generally a binomial setting (where only two 
outcomes are possible). This section follows that route. Assume a gambler is playing a 
coin tossing game against an infinitely rich bank or casino. Assume further that the 
probability for heads is some value p for which 5 < p < 1 holds. Probability for tails is 
defined by q = 1 - p < +. The gambler can place bets b > 0 of arbitrary size, whereby 
the gambler wins the same amount if right and loses it all if wrong. Given the 
assumptions about the probabilities, the gambler would of course want to bet on 
heads. Therefore, the expected value for this betting game B (i.e., the random variable 
representing this game) in a one-shot setting is: 


E(B] = p-b-q-b=(p-q)-b>0 


A risk-neutral gambler with unlimited funds would like to bet as large an amount as 
possible since this would maximize the expected payoff. However, trading in finan- 
cial markets is not a one-shot game in general. It is a repeated one. Therefore, assume 
that b; represents the amount that is bet on day i and that c, represents the initial cap- 
ital. The capital c, at the end of day one depends on the betting success on that day 
and might be either c, + b, or co — b,. The expected value for a gamble that is repeated 
n times then is: 
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E[B"]=q +2 (p-q)-b 


In classical economic theory, with risk-neutral, expected utility-maximizing agents, a 
gambler would try to maximize this expression. It is easily seen that it is maximized 
by betting all available funds—ie., b; = c;_ ,—like in the one-shot scenario. However, 
this in turn implies that a single loss will wipe out all available funds and will lead to 
ruin (unless unlimited borrowing is possible). Therefore, this strategy does not lead 
to a maximization of long-term wealth. 


While betting the maximum capital available might lead to sudden ruin, betting 
nothing at all avoids any kind of loss but does not benefit from the advantageous 
gamble either. This is where the Kelly criterion comes into play, since it derives the 
optimal fraction f of the available capital to bet per round of betting. Assume that n = 
h + t, where h stands for the number of heads observed during n rounds of betting 
and where t stands for the number of tails. With these definitions, the available capi- 
tal after n rounds is: 


C, = Cy (1+ f)* (1 - f)! 


In such a context, long-term wealth maximization boils down to maximizing the 
average geometric growth rate per bet, which is given as: 


Cr 1/n 
a 

el (1 - fy 1/n 
= log 


Co 


rë = log 


= loøg(0+ f0- f)” 
h t 
= ye ESF) eee lg- 7) 
The problem then formally is to maximize the expected average rate of growth by 
choosing f optimally. With E[h ] = n- p and E[t] = n - q, one gets: 
h t 
Flr] = E 7 eee lgl- S) 


= E[plog(1+f)+qlog(1- f)] 
= p log (1+ f)+q log (1 - f) 
= G(f) 
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One can now maximize the term by choosing the optimal fraction f according to the 
first-order condition. The first derivative is given by: 


G(f) 


q 


p 
l+f 1-f 


_ P-Þf-41-4f 


+f- 7) 


_ p-4-f 


+f- fF) 


From the first-order condition, one gets: 


GY 0> f'=p-4 


If one trusts this to be the maximum (and not the minimum), this result implies that 
it is optimal to invest a fraction f” = p - q per round of betting. With, for example, p = 
0.55 one has f` = 0.55 - 0.45 = 0.1, indicating that the optimal fraction is 10%. 


The following Python code formalizes these concepts and results through simulation. 
First, some imports and configurations: 


In [1]: 


In [2]: 


import math 

import time 

import numpy as np 
import pandas as pd 
import datetime as dt 
import cufflinks as cf 
from pylab import plt 
np.random.seed(1000) 
plt.style.use('seaborn') 
%matplotlib inline 


The idea is to simulate, for example, 50 series with 100 coin tosses per series. The 
Python code for this is straightforward: 


In [3]: 
In [4]: 


In [5]: 
Out[5]: 


In [6]: 


In [7]: 


p=0.55 Q 
f=p-(1-p) @ 
fO 

0. 10000000000000009 
1-50 @ 

n=100 0 


@ Fixes the probability for heads. 
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© Calculates the optimal fraction according to the Kelly criterion. 


© 
(4) 


The number of series to be simulated. 


The number of trials per series. 


The major part is the Python function run_simulation(), which achieves the simula- 


tion according to the prior assumptions. Figure 16-1 shows the simulation results: 


© © 6 6 8 8 8 


In [8]: def run_simulation(f): 
c = np.zeros((n, I)) (1) 
c[0] = 100 @ 
for i in range(I): © 
for t in range(1, n): (4) 


o = np.random.binomial(1, p) (5) 


if o > 0: 
c[t, i] = (1 + f) * 
else: 
c[t, i] = (1 - f) * 
return c 


In [9]: c_1 = run_simulation(f) © 


In [10]: c_1.round(2) 


Out[10]: array([[100. , 100. , 100. ,... 
[ 98. 5 TO. 5. 90. 4 sas 
(985. AES T 


[22635 338.13; 413i27 5. 22 
[248.99, 371.94, 454.6, ... 
(273.89, 409.14, 409.14, ... 


In [11]: plt.figure(figsize=(10, 6)) 
plt.plot(c_1, 'b', lw=0.5) ®© 


ct - 1, i] @ 

cit -1,i] © 

, 100. , 100. , 100. ], 
, 110. , 90. , 110. ], 
s Aia ge AB. gs ede. . 
s 123.97, 123.97 .. 123.97), 
, 136.37, 136.37, 136.37], 
5. 122.73, 150.01, 122..73]]) 


plt.plot(c_1.mean(axis=1), 'r', lw=2.5); 


(2) 


Instantiates an ndarray object to store the simulation results. 


Initializes the starting capital with 100. 
Outer loop for the series simulations. 
Inner loop for the series itself. 
Simulates the tossing of a coin. 

If 1, i.e., heads ... 


... then add the win to the capital. 
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If 0, i.e., tails ... 
... then subtract the loss from the capital. 
Runs the simulation. 


Plots all 50 series. 


© © 6 8 Ọ 


Plots the average over all 50 series. 
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Figure 16-1. 50 simulated series with 100 trials each (red line = average) 


The following code repeats the simulation for different values of f. As shown in 
Figure 16-2, a lower fraction leads to a lower growth rate on average. Higher values 


might lead to a higher average capital at the end of the simulation (f 
much lower average capital (f = 0.5). In both cases where the fraction 
volatility increases considerably: 


In [12]: c_2 = run_simulation(0.05) (1 

In [13]: c_3 = run_simulation(0.25) @ 

In [14]: c_4 = run_simulation(0.5) © 

In [15]: plt.figure(figsize=(10, 6)) 
plt.plot(c_1.mean(axis=1), 'r', label='$f^*=0.1$') 


plt.plot(c_2.mean(axis=1), 'b', label='$f=0.05$') 
plt.plot(c_3.mean(axis=1), 'y', label='$f=0.25$') 


= 0.25) or toa 
fis higher, the 
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plt.plot(c_4.mean(axis=1), 'm', label='$f=0.5S') 
plt.legend(loc=0); 


@ Simulation with f= 0.05. 
© Simulation with f= 0.25. 


© Simulation with f= 0.5. 


sf =O. 
— f=0.05 
700 | =0.25 
— f=05 
600 
500 
400 
300 
200 
100 
0 20 40 60 80 100 


Figure 16-2. Average capital over time for different fractions 


The Kelly Criterion for Stocks and Indices 


Assume now a stock market setting in which the relevant stock (index) can take on 
only two values after a period of one year from today, given its known value today. 
The setting is again binomial, but this time a bit closer on the modeling side to stock 
market realities.' Specifically, assume that: 


1 
P(r? =p+0)= P(r =p-o)=5 


1 The exposition follows Hung (2010). 
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with E[r*] = > 0 being the expected return of the stock over one year and o > 0 
being the standard deviation of returns (volatility). In a one-period setting, one gets 
for the available capital after one year (with c, and f defined as before): 


ce(f)= + - f)-r+ f-r’) 


Here, r is the constant short rate earned on cash not invested in the stock. Maximiz- 
ing the geometric growth rate means maximizing the term: 


G(f) = log 


Assume now that there are n relevant trading days in the year so that for each such 
trading day i: 


1 
2 


=o = 


Note that volatility scales with the square root of the number of trading days. Under 
these assumptions, the daily values scale up to the yearly ones from before and one 
gets: 


o(f) = 6M (1+. fo + fers 


One now has to maximize the following quantity to achieve maximum long-term 
wealth when investing in the stock: 


G,(f) = E log s 


Il 
= 
Me: 


log (1+ -P + fre) 


il 
= 


x 


1+0- f) Af u 


log 
1 


| 
N| = 


i 


ieg a 


+ log 


tH 


TE S) AP 


n 
= > log 
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Using a Taylor series expansion, one finally arrives at: 


2 


Pru- dS- Fda] 


or for infinitely many trading points in time—i.e., for continuous trading—at: 


2 


G(f)=rt-n f- Sf? 


The optimal fraction f then is given through the first-order condition by the 
expression: 


Le., the expected excess return of the stock over the risk-free rate divided by the var- 
iance of the returns. This expression looks similar to the Sharpe ratio (see “Portfolio 
Optimization” on page 415) but is different. 


A real-world example shall illustrate the application of these formulae and their role 
in leveraging equity deployed to trading strategies. The trading strategy under con- 
sideration is simply a passive long position in the S&P 500 index. To this end, base 
data is quickly retrieved and required statistics are easily derived: 


In [16]: raw = pd.read_csv('../../source/tr_eikon_eod_data.csv', 
index_col=0, parse_dates=True) 


In [17]: symbol = '.SPX' 

In [18]: data = pd.DataFrame(raw[symbol]) 

In [19]: data['returns'] = np.log(data / data.shift(1)) 
In [20]: data.dropna(inplace=True) 


In [21]: data. tail() 

Out[21]: .SPX returns 
Date 
2018-06-25 2717.07 -0.013820 
2018-06-26 2723.06 0.002202 
2018-06-27 2699.63 -0.008642 
2018-06-28 2716.31 0.006160 
2018-06-29 2718.37 0.000758 
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The statistical properties of the S&P 500 index over the period covered suggest an 
optimal fraction of about 4.5 to be invested in the long position in the index. In other 
words, for every dollar available 4.5 dollars shall be invested—implying a leverage 
ratio of 4.5, in accordance with the optimal Kelly “fraction” (or rather “factor” in this 
case). Ceteris paribus, the Kelly criterion implies a higher leverage the higher the 
expected return and the lower the volatility (variance): 


In [22]: mu = data.returns.mean() * 252 (1) 


In [23]: mu 1] 
Out[23]: 0.09898579893004976 


In [24]: sigma = data.returns.std() * 252 ** 0.5 (2) 


In [25]: sigma (2) 
Out[25]: 0.1488567510081967 


In [26]: r= 0.0 © 
In [27]: f = (mu - r) / sigma ** 2 (4) 
In [28]: f @ 


Out[28]: 4.4672043679706865 


@ Calculates the annualized return. 

© Calculates the annualized volatility. 

© Sets the risk-free rate to 0 (for simplicity). 
(4) 


Calculates the optimal Kelly fraction to be invested in the strategy. 


The following code simulates the application of the Kelly criterion and the optimal 
leverage ratio. For simplicity and comparison reasons, the initial equity is set to 1 
while the initially invested total capital is set to 1- f `. Depending on the performance 
of the capital deployed to the strategy, the total capital itself is adjusted daily accord- 
ing to the available equity. After a loss, the capital is reduced; after a profit, the capital 
is increased. The evolution of the equity position compared to the index itself is 
shown in Figure 16-3: 


In [29]: equs = [] 


In [30]: def kelly_strategy(f): 
global equs 
equ = ‘equity_{:.2f}'.format(f) 
equs.append(equ) 
cap = 'capital_{:.2f}'.format(f) 
data[equ] = 1 
data[cap] = data[equ] * f (2) 
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© © © O 8 8 


for i, t in enumerate(data.index[1:]): 
t_1 = data.index[i] © 
data.loc[t, cap] = data[cap].loc[t_1] * \ 
math.exp(data['returns'].loc[t]) (4) 

= data[cap].loc[t] - \ 
data[cap].loc[t_1] + \ 
data[equ].loc[t_1] (5) 


data.loc[t, equ] 


data.loc[t, cap] 


In [31]: kelly_strategy(f * 0.5) 


In [32]: kelly_strategy(f * 0.66) 


In [33]: kelly_strategy(f) (9) 


In [34]: print(data[equs].tail()) 
equity_2.23 


Date 

2018-06-25 
2018-06-26 
2018-06-27 
2018-06-28 
2018-06-29 


4.707070 
4.730248 
4.639340 
4.703365 
4.711332 


(7) 
© 


equity_2.95 equity_4.47 


6.367340 
6.408727 
6.246147 
64359932 
6.374152 


8.794342 
8.880952 
8.539593 
8.775296 
8.805026 


data[equ].loc[t] * f 6] 


In [35]: ax = data['returns'].cumsum().apply(np.exp).plot(legend=True, 


data[equs].plot(ax=ax, Legend=True); 


figsize=(10, 6)) 


Generates a new column for equity and sets the initial value to 1. 


Generates a new column for capital and sets the initial value to 1- f `. 


Picks the right DatetimeIndex value for the previous values. 


Calculates the new capital position given the return. 


Adjusts the equity value according to the capital position performance. 


Adjusts the capital position given the new equity position and the fixed leverage 


ratio. 


Simulates the Kelly criterion—based strategy for half off... 


... for two-thirds of f ... 


... and for f itself. 
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Figure 16-3. Cumulative performance of S&P 500 compared to equity position given 
different values of f 


As Figure 16-3 illustrates, applying the optimal Kelly leverage leads to a rather erratic 
evolution of the equity position (high volatility) which is—given the leverage ratio of 
4.47— intuitively plausible. One would expect the volatility of the equity position to 
increase with increasing leverage. Therefore, practitioners often reduce the leverage 
to, for example, “half Kelly’—i.e., in the current example to > f` = 2.23. Therefore, 
Figure 16-3 also shows the evolution of the equity position of values lower than “full 
Kelly.” The risk indeed reduces with lower values of f. 


ML-Based Trading Strategy 


Chapter 14 introduces the FXCM trading platform, its REST API, and the Python 
wrapper package fxcmpy. This section combines an ML-based approach for predict- 
ing the direction of market price movements with historical data from the FKCM 
REST API to backtest an algorithmic trading strategy for the EUR/USD currency pair. 
It uses vectorized backtesting, taking into account this time the bid-ask spread as pro- 
portional transaction costs. It also adds, compared to the plain vectorized backtesting 
approach as introduced in Chapter 15, a more in-depth analysis of the risk character- 
istics of the trading strategy tested. 
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Vectorized Backtesting 


The backtest is based on intraday data, more specifically on bars of length five 
minutes. The following code connects to the FKCM REST API and retrieves five- 
minute bar data for a whole month. Figure 16-4 visualizes the mid close prices over 
the period for which data is retrieved: 


In [36]: 


In [37]: 
Out[37]: 


In [38]: 


In [39]: 


In [40]: 
Out [40]: 


In [41]: 


In [42]: 
Out[42]: 
In [43]: 
In [44]: 


Out[44]: 


import fxcmpy 


fxcmpy.__version__ 
"11,33" 


api = fxcmpy.fxcmpy(config_file='../fxcm.cfg') (1) 

data = api.get_candles('EUR/USD', period='m5', 
start='2018-06-01 00:00:00', 
stop='2018-06-30 00:00:00') @ 


data.iloc[-5:, 4:] 
askopen askclose askhigh asklow tickqty 


date 

2018-06-29 20:35:00 1.16862 1.16882 1.16896 1.16839 601 
2018-06-29 20:40:00 1.16882 1.16853 1.16898 1.16852 387 
2018-06-29 20:45:00 1.16853 1.16826 1.16862 1.16822 592 
2018-06-29 20:50:00 1.16826 1.16836 1.16846 1.16819 842 
2018-06-29 20:55:00 1.16836 1.16861 1.16876 1.16834 540 


data.info() 

<class 'pandas.core.frame.DataFrame'> 
DatetimeIndex: 6083 entries, 2018-06-01 00:00:00 to 2018-06-29 20:55:00 
Data columns (total 9 columns): 
bidopen 6083 non-null float64 
bidclose 6083 non-null float64 
bidhigh 6083 non-null float64 
bidlow 6083 non-null float64 
askopen 6083 non-null float64 
askclose 6083 non-null float64 
askhigh 6083 non-null float64 
asklow 6083 non-null float64 
tickqty 6083 non-null int64 
dtypes: float64(8), int64(1) 

memory usage: 475.2 KB 


spread = (data['askclose'] - data['bidclose']).mean() (2) 
spread 
2.6338977478217845e-05 


data['midclose'] = (data['askclose'] + data['bidclose']) / 2 © 
ptc = spread / data['midclose'].mean() (4) 


ptc 
2.255685318140426e-05 
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In [45]: data['midclose'].plot(figsize=(10, 6), legend=True); 
Connects to the API and retrieves the data. 
Calculates the average bid-ask spread. 


Calculates the mid close prices from the ask and bid close prices. 


© © 8 8 


Calculates the average proportional transaction costs given the average spread 
and the average mid close price. 
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Figure 16-4. EUR/USD exchange rate (five-minute bars) 


The ML-based strategy is based on lagged return data that is binarized. That is, the 
ML algorithm learns from historical patterns of upward and downward movements 
whether another upward or downward movement is more likely. Accordingly, the 
following code creates features data with values of 0 and 1 as well as labels data with 
values of +1 and -1 indicating the observed market direction in all cases: 


In [46]: data['returns'] = np.log(data['midclose'] / data['midclose'].shift(1)) 
In [47]: data.dropna(inplace=True) 
In [48]: lags = 5 


In [49]: cols = [] 
for lag in range(1, lags + 1): 
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col = 'lag_{}'.format(lag) 
data[col] = data['returns'].shift(lag) 1) 
cols.append(col) 


In [50]: data.dropna(inplace=True) 
In [51]: data[cols] = np.where(data[cols] > 0, 1, 0) (2) 
In [52]: data['direction'] = np.where(data['returns'] > 0, 1, -1) © 


In [53]: data[cols + ['direction']].head() 

Out[53]: lag_1 lag_2 lag_3 lag_4 lag_5 direction 
date 
2018-06-01 00:30:00 
2018-06-01 00:35:00 
2018-06-01 00:40:00 
2018-06-01 00:45:00 
2018-06-01 00:50:00 


PRPPPP 
PRPRPR Oo 
PRPRPOR 
rPRORO 
rPOrROR 
BPRPPPP 


@ Creates the lagged return data given the number of lags. 
© Transforms the feature values to binary data. 


© Transforms the returns data to directional label data. 


Given the features and label data, different supervised learning algorithms can now 
be applied. In what follows, a support vector machine algorithm for classification is 
used from the scikit-learn ML package. The code trains and tests the algorithmic 
trading strategy based on a sequential train-test split. The accuracy scores of the 
model for the training and test data are slightly above 50%, while the score is even a 
bit higher on the test data. Instead of accuracy scores, one would also speak in a 
financial trading context of the hit ratio of the trading strategy; i.e., the number of 
winning trades compared to all trades. Since the hit ratio is greater than 50%, this 
might indicate—in the context of the Kelly criterion—a slight edge compared to a 
random walk setting: 


In [54]: from sklearn.svm import SVC 
from sklearn.metrics import accuracy_score 


In [55]: model = SVC(C=1, kernel='Linear', gamma='auto') 

In [56]: split = int(len(data) * 0.80) 

In [57]: train = data.iloc[:split].copy() 

In [58]: model.fit(train[cols], train['direction']) 

Out[58]: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, 
decision_function_shape='ovr', degree=3, gamma='auto', kernel='Linear', 


max_iter=-1, probability=False, random_state=None, shrinking=True, 
tol=0.001, verbose=False) 
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In [59]: 
Out[59]: 


In [60]: 
In [61]: 


In [62]: 
Out[62]: 


accuracy_score(train[ 'direction'], model.predict(train[cols])) (1) 
0.5198518823287389 


test = data.iloc[split:].copy() 
test['position'] = model.predict(test[cols]) 


accuracy_score(test['direction'], test['position']) (2) 
0.5419407894736842 


@ The accuracy of the predictions from the trained model in-sample (training 


data). 


© The accuracy of the predictions from the trained model out-of-sample (test data). 


It is well known that the hit ratio is only one aspect of success in financial trading. 
Also crucial are, among other things, the transaction costs implied by the trading 
strategy and getting the important trades right.? To this end, only a formal vectorized 
backtesting approach allows judgment of the quality of the trading strategy. The fol- 
lowing code takes into account the proportional transaction costs based on the aver- 
age bid-ask spread. Figure 16-5 compares the performance of the algorithmic trading 
strategy (without and with proportional transaction costs) to the performance of the 
passive benchmark investment: 


In [63]: 


In [64]: 
Out[64]: 


In [65]: 


In [66]: 


Out[66]: 


In [67]: 


test['strategy'] = test['position'] * test['returns'] (13 


sum(test['position'].diff() != 0) (2) 
660 


test['strategy_tc'] = np.where(test['position'].diff() != 0, 
test['strategy'] - ptc, © 
test['strategy']) 


test[['returns', 'strategy', 'strategy_tc']].sum( 
).apply(np.exp) 

returns 0.999324 

strategy 1.026141 

strategy_tc 1.010977 

dtype: float64 


test[['returns', 'strategy', 'strategy_tc']].cumsum( 
).apply(np.exp).plot(figsize=(10, 6)); 


2 It is a stylized empirical fact that it is of paramount importance for investment and trading performance to 


get the largest market movements right—i.e., the biggest upward and downward movements. This aspect is 
neatly illustrated in Figures 16-5 and 16-7, which show that the trading strategy gets a large upward move- 
ment in the underlying instrument wrong, leading to a large dip for the trading strategy. 
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Derives the log returns for the ML-based algorithmic trading strategy. 


Calculates the number of trades implied by the trading strategy based on changes 
in the position. 


© Whenever a trade takes place, the proportional transaction costs are subtracted 
from the strategy’s log return on that day. 
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Figure 16-5. Performance of EUR/USD exchange rate and algorithmic trading strategy 


Limitations of Vectorized Backtesting 


Vectorized backtesting has its limits with regard to how closely to 
market realities strategies can be tested. For example, it does not 
allow direct inclusion of fixed transaction costs per trade. One 
could, as an approximation, take a multiple of the average propor- 
tional transaction costs (based on average position sizes) to 
account indirectly for fixed transactions costs. However, this would 
not be precise in general. If a higher degree of precision is required 
other approaches, such as event-based backtesting with explicit 
loops over every bar of the price data, need to be applied. 
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Optimal Leverage 


Equipped with the trading strategy’s log returns data, the mean and variance values 
can be calculated in order to derive the optimal leverage according to the Kelly crite- 
rion. The code that follows scales the numbers to annualized values, although this 
does not change the optimal leverage values according to the Kelly criterion since the 
mean return and the variance scale with the same factor: 


In [68]: mean = test[['returns', 'strategy_tc']].mean() * len(data) * 12 (1) 
mean 

Out[68]: returns -0.040535 
strategy_tc 0.654711 
dtype: float64 


In [69]: var = test[['returns', 'strategy_tc']].var() * len(data) * 12 (2) 
var 

Out[69]: returns 0.007861 
strategy_tc 0.007837 
dtype: float64 


In [70]: vol = var ** 0.5 © 
vol 

Out[70]: returns 0.088663 
strategy_tc 0.088524 
dtype: float64 


In [71]: mean / var (4) 

Out[71]: returns -5.156448 
strategy_tc 83.545792 
dtype: float64 

In [72]: mean / var * 0.5 (5) 

Out[72]: returns -2.578224 


strategy_tc 41.772896 
dtype: float64 


Annualized mean returns. 
Annualized variances. 
Annualized volatilities. 


Optimal leverage according to the Kelly criterion (“full Kelly”). 


© © 68 8 8 


Optimal leverage according to the Kelly criterion (“half Kelly”). 


Using the “half Kelly” criterion, the optimal leverage for the trading strategy is about 
40. With a number of brokers, such as FKCM, and financial instruments, such as for- 
eign exchange and contracts for difference (CFDs), such leverage ratios are feasible, 
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even for retail traders.* Figure 16-6 shows in comparison the performance of the trad- 
ing strategy with transaction costs for different leverage values: 


In [73]: to_plot = ['returns', 'strategy_tc'] 

In [74]: for lev in [10, 20, 30, 40, 50]: 
label = 'lstrategy_tc_%d' % lev 
test[label] = test['strategy_tc'] * lev (1) 
to_plot.append( label) 

In [75]: test[to_plot].cumsum().apply(np.exp).plot(figsize=(10, 6)); 


@ Scales the strategy returns for different leverage values. 
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Figure 16-6. Performance of algorithmic trading strategy for different leverage values 


Risk Analysis 


Since leverage increases the risk associated with a trading strategy, a more in-depth 
risk analysis seems in order. The risk analysis that follows assumes a leverage ratio of 


3 Leverage increases risks associated with trading strategies significantly. Traders should read the risk disclaim- 
ers and regulations carefully. A positive backtesting performance is also no guarantee whatsoever of future 
performance. All results shown are illustrative only and are meant to demonstrate the application of pro- 
gramming and analytics approaches. In some jurisdictions, such as in Germany, leverage ratios are capped for 
retail traders based on different groups of financial instruments. 
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30. First, the maximum drawdown and the longest drawdown period are calculated. 
Maximum drawdown is the largest loss (dip) after a recent high. Accordingly, the 
longest drawdown period is the longest period that the trading strategy needs to get 
back to a recent high. The analysis assumes that the initial equity position is 3,333 
EUR, leading to an initial position size of 100,000 EUR for a leverage ratio of 30. It 
also assumes that there are no adjustments with regard to the equity over time, no 
matter what the performance is: 


In [76]: equity = 3333 (1) 
In [77]: risk = pd.DataFrame(test['lstrategy_tc_30']) (2) 


In [78]: risk['equity'] = risk['lstrategy_tc_30'].cumsum( 
).apply(np.exp) * equity 


In [79]: risk['cummax'] = risk['equity'].cummax() (4) 
In [80]: risk['drawdown'] = risk['cummax'] - risk['equity'] (5) 


In [81]: risk['drawdown'].max() Q 
Out[81]: 781.7073602069818 


In [82]: t_max = risk['drawdown'].idxmax() (7) 
t_max @ 
Out[82]: Timestamp('2018-06-29 02:45:00') 


The initial equity. 

The relevant log returns time series ... 

... scaled by the initial equity. 

The cumulative maximum values over time. 
The drawdown values over time. 


The maximum drawdown value. 


© © © © © 8 8 


The point in time when it happens. 
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Technically a (new) high is characterized by a drawdown value of 0. The drawdown 
period is the time between two such highs. Figure 16-7 visualizes both the maximum 
drawdown and the drawdown periods: 


In [83]: temp = risk['drawdown'][risk['drawdown'] == 0] (13 


In [84]: periods = (temp.index[1:].to_pydatetime() - 
temp.index[:-1].to_pydatetime()) (2) 


In [85]: periods[20:30] (2) 
Out[85]: array([datetime.timedelta(seconds=68700), 
datetime. timedelta(seconds=72000), 
datetime. timedelta(seconds=1800), datetime. timedelta(seconds=300), 
datetime. timedelta(seconds=600), datetime. timedelta(seconds=300), 
datetime. timedelta(seconds=17400), 
datetime. timedelta(seconds=4500), datetime.timedelta(seconds=1500), 
datetime. timedelta(seconds=900)], dtype=object) 


In [86]: t_per = periods.max() © 


In [87]: t_per © 
Out[87]: datetime.timedelta(seconds=76500) 


In [88]: t_per.seconds / 60 / 60 (4) 
Out[88]: 21.25 


In [89]: risk[['equity', 'cummax']].plot(figsize=(10, 6)) 
plt.axvline(t_max, c='r', alpha=0.5); 


Identifies highs for which the drawdown must be 0. 
Calculates the timedelta values between all highs. 
The longest drawdown period in seconds ... 


... and hours. 
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Figure 16-7. Maximum drawdown (vertical line) and drawdown periods (horizontal 
lines) 


Another important risk measure is value-at-risk (VaR). It is quoted as a currency 
amount and represents the maximum loss to be expected given both a certain time 
horizon and a confidence level. The code that follows derives VaR values based on 
the log returns of the equity position for the leveraged trading strategy over time for 
different confidence levels. The time interval is fixed to the bar length of five minutes: 


In [91]: import scipy.stats as scs 
In [92]: percs = np.array([0.01, 0.1, 1., 2.5, 5.0, 10.0]) @ 


In [93]: risk['returns'] = np.log(risk['equity'] / 
risk['equity'].shift(1)) 


In [94]: VaR = scs.scoreatpercentile(equity * risk['returns'], percs) (2) 


In [95]: def print_var(): 
print('%16s %16s' % ('Confidence Level', 'Value-at-Risk')) 
print(33 * '-') 
for pair in zip(percs, VaR): 
print('%16.2f %16.3f' % (100 - pair[0], -pair[1])) © 


In [96]: print_var() © 
Confidence Level Value-at-Risk 


542 | Chapter 16: Automated Trading 


99.90 175.932 


99.00 88.139 
97.50 60.485 
95.00 45.010 
90.00 32.056 


Defines the percentile values to be used. 
Calculates the VaR values given the percentile values. 


Translates the percentile values into confidence levels and the VaR values (nega- 
tive values) to positive values for printing. 


Finally, the following code calculates the VaR values for a time horizon of one hour 
by resampling the original DataFrame object. In effect, the VaR values are increased 
for all confidence levels but the highest one: 


In [97]: hourly = risk.resample('1H', Label='right').last() 1] 


In [98]: hourly['returns'] = np.log(hourly['equity'] / 
hourly['equity'].shift(1)) 


In [99]: VaR = scs.scoreatpercentile(equity * hourly['returns'], percs) (2) 


In [100]: print_var() 


Confidence Level Value-at-Risk 
99.99 389.524 
99.90 372: 657 
99.00 205.662 
97.50 186.999 
95.00 164.869 
90.00 101.835 


@ Resamples the data from five-minute to one-hour bars. 


© Recalculates the VaR values for the resampled data. 


Persisting the Model Object 


Once the algorithmic trading strategy is “accepted” based on the backtesting, leverag- 
ing, and risk analysis results, the model object might be persisted for later use in 
deployment. It embodies now the ML-based trading strategy or the trading algo- 
rithm: 


In [101]: import pickle 


In [102]: pickle.dump(model, open('algorithm.pkl', 'wb')) 
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Online Algorithm 


The trading algorithm tested so far is an offline algorithm. Such algorithms use a 
complete data set to solve a problem at hand. The problem has been to train an SVM 
algorithm based on binarized features data and directional label data. In practice, 
when deploying the trading algorithm in financial markets, it must consume data 
piece-by-piece as it arrives to predict the direction of the market movement for the 
next time interval (bar). This section makes use of the persisted model object from 
the previous section and embeds it into a streaming data environment. 


The code that transforms the offline trading algorithm into an online trading algo- 
rithm mainly addresses the following issues: 


Tick data 
Tick data arrives in real time and is to be processed in real time 


Resampling 
The tick data is to be resampled to the appropriate bar size given the trading 
algorithm 


Prediction 
The trading algorithm generates a prediction for the direction of the market 
movement over the relevant time interval that by nature lies in the future 


Orders 
Given the current position and the prediction (“signal”) generated by the algo- 
rithm, an order is placed or the position is kept 


“Retrieving Streaming Data” on page 477 shows how to retrieve tick data from the 
FXCM REST API in real time. The basic approach is to subscribe to a market data 
stream and pass a callback function that processes the data. 


First, the persisted trading algorithm is loaded—it represents the trading logic to be 
followed. It might also be useful to define a helper function to print out the open 
position(s) while the trading algorithm is trading: 


In [103]: algorithm = pickle. load(open('algorithm.pkl', 'rb')) 


In [104]: algorithm 

Out[104]: SVC(C=1, cache_size=200, class_weight=None, coef0=0.0, 
decision_function_shape='ovr', degree=3, gamma='auto', 
kernel='Linear', max_iter=-1, probability=False, 
random_state=None, shrinking=True, tol=0.001, verbose=False) 


In [105]: sel = ['tradeId', 'amountK', 'currency', 
'grossPL', 'isBuy'] (1) 


In [106]: def print_positions(pos): 
print('\n\n' + 50 * '=') 
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print('Going {}.\n'.format(pos)) 
time.sleep(1.5) 
print(api.get_open_positions()[sel]) © 
print(50 * '=' + '\n\n') 


@ Defines the DataFrame columns to be shown. 
@ Waits a bit for the order to be executed and reflected in the open positions. 


© Prints the open positions. 
Before the online algorithm is defined and started, a few parameter values are set: 


In [107]: symbol = 'EUR/USD' @ 
bar = '15s' @ 
amount = 100 © 
position = 0 (4) 
min_bars = lags + 1 (5) 
df = pd.DataFrame() (6) 


Instrument symbol to be traded. 


Bar length for resampling; for easier testing, the bar length might be shortened 
compared to the real deployment length (e.g., 15 seconds instead of 5 minutes). 


The amount, in thousands, to be traded. 
The initial position (“neutral”). 


The minimum number of resampled bars required for the first prediction and 
trade to be possible. 


© An empty DataFrame object to be used later for the resampled data. 


Following is the callback function automated_strategy() that transforms the trad- 
ing algorithm into a real-time context: 


In [108]: def automated_strategy(data, dataframe): 
global min_bars, position, df 
ldf = len(dataframe) (13 
df = dataframe.resample(bar, label='right').last().ffill() (2) 
if ldf % 20 == 0: 
print('%3d' % len(dataframe), end=',') 


if len(df) > min_bars: 
min_bars = len(df) 
df['Mid'] = df[['Bid', 'Ask']].mean(axis=1) 
df['Returns'] = np.log(df['Mid'] / df['Mid'].shift(1)) 
df['Direction'] = np.where(df['Returns'] > 0, 1, -1) 
features = df['Direction'].iloc[-(lags + 1):-1] © 
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features = features.values.reshape(1, -1) 4) 
signal = algorithm.predict(features)[0] (5) 


if position in [0, -1] and signal == 1: Q 
api.create_market_buy_order( 
symbol, amount - position * amount) 
position = 1 
print_positions('LONG') 
elif position in [0, 1] and signal == -1: @ 
api.create_market_sell_order( 
symbol, amount + position * amount) 
position = -1 
print_positions('SHORT') 
if len(dataframe) > 350: (8 ] 


api.unsubscribe_market_data('EUR/USD' ) 
api.close_all() 


Captures the length of the DataFrame object with the tick data. 
Resamples the tick data to the defined bar length. 

Picks the relevant feature values for all lags ... 

... and reshapes them to a form that the model can use for prediction. 
Generates the prediction value (either +1 or -1). 

The conditions to enter (or keep) a long position. 


The conditions to enter (or keep) a short position. 


O ~ © O O Ọ © Ọ 


The condition to stop trading and close out any open positions (arbitrarily 
defined based on the number of ticks retrieved). 


Infrastructure and Deployment 


Deploying an automated algorithmic trading strategy with real funds requires an 
appropriate infrastructure. Among others, the infrastructure should satisfy the fol- 
lowing conditions: 


Reliability 
The infrastructure on which to deploy an algorithmic trading strategy should 
allow for high availability (e.g., > 99.9%) and should otherwise take care of relia- 
bility (automatic backups, redundancy of drives and web connections, etc.). 
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Performance 
Depending on the amount of data being processed and the computational 
demand the algorithms generate, the infrastructure must have enough CPU 
cores, working memory (RAM), and storage (SSD); in addition, the web connec- 
tions should be sufficiently fast. 


Security 
The operating system and the applications run on it should be protected by 
strong passwords as well as SSL encryption; the hardware should be protected 
from fire, water, and unauthorized physical access. 


Basically, these requirements can only be fulfilled by renting appropriate infrastruc- 
ture from a professional data center or a cloud provider. Investments in the physical 
infrastructure to satisfy the aforementioned requirements can in general only be jus- 
tified by the bigger or even biggest players in the financial markets. 


From a development and testing point of view, even the smallest Droplet (cloud 
instance) from DigitalOcean is enough to get started. At the time of this writing such 
a Droplet costs 5 USD per month; usage is billed by the hour and a server can be cre- 
ated within minutes and destroyed within seconds.’ 


How to set up a Droplet with DigitalOcean is explained in detail in the section “Using 
Cloud Instances” on page 50, with bash scripts that can be adjusted to reflect individ- 
ual requirements regarding Python packages, for example. 


Operational Risks 


Although the development and testing of automated algorithmic 
trading strategies is possible from a local computer (desktop, note- 
book, etc.), it is not appropriate for the deployment of live strate- 
gies trading real money. A simple loss of the web connection or a 
brief power outage might bring down the whole algorithm, leaving, 
for example, unintended open positions in the portfolio or causing 
data set corruption (due to missing out on real-time tick data), 
potentially leading to wrong signals and unintended trades/ 
positions. 


Logging and Monitoring 


Let’s assume that the automated algorithmic trading strategy is to be deployed on a 
remote server (cloud instance, leased server, etc.), that all required Python packages 
have been installed (see “Using Cloud Instances” on page 50), and that, for instance, 


4 Use the link http://bit.ly/do_sign_up to get a 10 USD bonus on DigitalOcean when signing up for a new 
account. 
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Jupyter Notebook is running securely. What else needs to be considered from the 
algorithmic trader’s point of view if they do not want to sit all day in front of the 
screen while logged in to the server? 


This section addresses two important topics in this regard: logging and real-time 
monitoring. Logging persists information and events on disk for later inspection. It is 
standard practice in software application development and deployment. However, 
here the focus might be put rather on the financial side, logging important financial 
data and event information for later inspection and analysis. The same holds true for 
real-time monitoring making use of socket communication. Via sockets a constant 
real-time stream of important financial aspects can be created that can be retrieved 
and processed on a local computer, even if the deployment happens in the cloud. 


“Automated Trading Strategy” on page 550 presents a Python script implementing all 
these aspects and making use of the code from “Online Algorithm” on page 544. The 
script puts the code in a shape that allows, for example, the deployment of the algo- 
rithmic trading strategy—based on the persisted algorithm object—on a remote 
server. It adds both logging and monitoring capabilities based on a custom function 
that, among others, makes use of ZeroMQ for socket communication. In combination 
with the short script from “Strategy Monitoring” on page 553, this allows for remote 
real-time monitoring of the activity on a remote server. 


When the script from “Automated Trading Strategy” on page 550 is run, either locally 
or remotely, the output that is logged and sent via the socket looks as follows: 


2018-07-25 09:16:15.568208 


MOST RECENT DATA 
Mid Returns Direction 


2018-07-25 07:15:30 1.168885 -0.000009 -1 
2018-07-25 07:15:45 1.168945 0.000043 1 
2018-07-25 07:16:00 1.168895 -0.000051 -1 
2018-07-25 07:16:15 1.168895 -0.000009 -1 
2018-07-25 07:16:30 1.168885 -0.000017 -1 


features: [[ 1 -1 1 -1 -1]] 
position: -1 
signal: -1 


2018-07-25 09:16:15.581453 


no trade placed 


****END OF CYCLE*** 
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2018-07-25 09:16:30.069737 


MOST RECENT DATA 
Mid Returns Direction 


2018-07-25 07:15:45 1.168945 0.000043 1 
2018-07-25 07:16:00 1.168895 -0.000051 -1 
2018-07-25 07:16:15 1.168895 -0.000009 -1 
2018-07-25 07:16:30 1.168950 0.000034 1 
2018-07-25 07:16:45 1.168945 -0.000017 -1 


features: [[-1 1-1-1 1]] 
position: -1 
signal: 1 


2018-07-25 09:16:33.035094 


Going LONG. 


tradeId amountK currency grossPL isBuy 
© 61476318 100 EUR/USD -2 True 


****xEND OF CYCLE*** 


Running the script from “Strategy Monitoring” on page 553 locally then allows the 
real-time retrieval and processing of such information. Of course, it is easy to adjust 
the logging and streaming data to one’s own requirements.” Similarly, one can also, 
for example, persist DataFrame objects as created during the execution of the trading 
script. Furthermore, the trading script and the whole logic can be adjusted to include 
such elements as stop losses or take profit targets programmatically. Alternatively, 
one could make use of more sophisticated order types available via the FKCM trading 
API. 


5 Note that the socket communication as implemented in the two scripts is not encrypted and is sending plain 
text over the web, which might represent a security risk in production. 
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Consider All Risks 


Trading currency pairs and/or CFDs is associated with a number of 
financial risks. Implementing an algorithmic trading strategy for 
such instruments automatically leads to a number of additional 
risks. Among them are flaws in the trading and/or execution logic. 
as well as technical risks such as problems with socket communica- 
tions or delayed retrieval or even loss of tick data during the 
deployment. Therefore, before one deploys a trading strategy in 
automated fashion one should make sure that all associated mar- 
ket, execution, operational, technical, and other risks have been 
identified, evaluated, and addressed. The code presented in this 
chapter is intended only for technical illustration purposes. 


Conclusion 


This chapter is about the deployment of an algorithmic trading strategy—based on a 
classification algorithm from machine learning to predict the direction of market 
movements—in automated fashion. It addresses such important topics as capital 
management (based on the Kelly criterion), vectorized backtesting for performance 
and risk, the transformation of offline to online trading algorithms, an appropriate 
infrastructure for deployment, as well as logging and monitoring during deployment. 


The topic of this chapter is complex and requires a broad skill set from the algorith- 
mic trading practitioner. On the other hand, having a REST API for algorithmic trad- 
ing available, such as the one from FXCM, simplifies the automation task 
considerably since the core part boils down mainly to making use of the capabilities 
of the Python wrapper package fxcmpy for tick data retrieval and order placement. 
Around this core, elements to mitigate operational and technical risks as far as possi- 
ble have to be added. 


Python Scripts 
Automated Trading Strategy 


The following is the Python script to implement the algorithmic trading strategy in 
automated fashion, including logging and monitoring. 


# 

# Automated ML-Based Trading Strategy for FXCM 
# Online Algorithm, Logging, Monitoring 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

import zmq 
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import time 

import pickle 

import fxcmpy 

import numpy as np 
import pandas as pd 
import datetime as dt 


sel = ['tradeId', 'amountK', 'currency', 
"grossPL', 'isBuy'] 


log_file = 'automated_strategy.log' 


# loads the persisted algorithm object 
algorithm = pickle. load(open('algorithm.pkl', 'rb')) 


# sets up the socket communication via ZeroMQ (here: "publisher") 
context = zmq.Context() 
socket = context.socket(zmq.PUB) 


# this binds the socket communication to all IP addresses of the machine 
socket.bind('tcp://0.0.0.0:5555') 


def logger_monitor(message, time=True, sep=True): 
''' Custom logger and monitor function. 
with open(log_file, 'a') as f: 
t = str(dt.datetime.now()) 
msg = '' 
if time: 
msg += '\n' + t + '\n' 
if sep: 
msg += 66 * '=' + '\n' 
msg += message + '\n\n' 
# sends the message via the socket 
socket.send_string(msg) 
# writes the message to the log file 
f .write(msg) 


def report_positions(pos): 
''' Prints, logs and sends position data. 
out = '\n\n' + 50 * '=' + '\n' 
out += 'Going {}.\n'.format(pos) + '\n' 
time.sleep(2) # waits for the order to be executed 
out += str(api.get_open_positions()[sel]) + '\n' 


out += 50 * '=' + '\n' 
Logger_monitor (out) 
print(out) 
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def automated_strategy(data, dataframe): 
''' Callback function embodying the trading logic. 
global min_bars, position, df 
# resampling of the tick data 
df = dataframe.resample(bar, label='right').last().ffill() 


if len(df) > min_bars: 
min_bars = len(df) 
Logger_monitor('NUMBER OF TICKS: {} | '.format(len(dataframe)) + 

"NUMBER OF BARS: {}'.format(min_bars)) 

# data processing and feature preparation 
df['Mid'] = df[['Bid', 'Ask']].mean(axis=1) 
df['Returns'] = np.log(df['Mid'] / df['Mid'].shift(1)) 
df['Direction'] = np.where(df['Returns'] > 0, 1, -1) 
# picks relevant points 
features = df['Direction'].iloc[-(lags + 1):-1] 
# necessary reshaping 
features = features.values.reshape(1, -1) 
# generates the signal (+1 or -1) 
signal = algorithm.predict(features)[0] 


# logs and sends major financial information 
logger_monitor('MOST RECENT DATA\n' + 
str(df[['Mid', 'Returns', 'Direction']].tail()), 


False) 

logger_monitor('features: ' + str(features) + '\n' + 
"position: ' + str(position) + '\n' + 
"signal: ' + str(signal), False) 


# trading logic 
if position in [0, -1] and signal == 1: # going long? 
api.create_market_buy_order( 
symbol, size - position * size) # places a buy order 
position = 1 # changes position to long 
report_positions('LONG') 


elif position in [0, 1] and signal == -1: # going short? 
api.create_market_sell_order( 
symbol, size + position * size) # places a sell order 
position = -1 # changes position to short 
report_positions('SHORT') 
else: # no trade 
logger_monitor('no trade placed') 


Llogger_monitor('****END OF CYCLE***\n\n', False, False) 


if len(dataframe) > 350: # stopping condition 
api.unsubscribe_market_data('EUR/USD') # unsubscribes from data stream 
report_positions('CLOSE OUT') 
api.close_all() # closes all open positions 
Logger_monitor('***CLOSING OUT ALL POSITIONS***') 
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if __name_ == '_ main_': 
symbol = 'EUR/USD' # symbol to be traded 
bar = '15s' # bar length; adjust for testing and deployment 
size = 100 # position size in thousand currency units 
position = 0 # initial position 
lags = 5 # number of lags for features data 
min_bars = lags + 1 # minimum length for resampled DataFrame 
df = pd.DataFrame() 
# adjust configuration file location 
api = fxcmpy.fxcmpy(config_file='../fxcm.cfg') 
# the main asynchronous loop using the callback function 
api.subscribe_market_data(symbol, (automated_strategy, )) 


Strategy Monitoring 


The following is the Python script to implement a local or remote monitoring of the 
automated algorithmic trading strategy via socket communication. 


# 

# Automated ML-Based Trading Strategy for FXCM 
# Strategy Monitoring via Socket Communication 
# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

import zmq 


# sets up the socket communication via ZeroMQ (here: "subscriber") 
context = zmq.Context() 
socket = context.socket(zmq.SUB) 


# adjust the IP address to reflect the remote location 
socket.connect('tcp://REMOTE_IP_ADDRESS:5555' ) 


# configures the socket to retrieve every message 
socket.setsockopt_string(zmq.SUBSCRIBE, '') 


while True: 
msg = socket.recv_string() 
print(msg) 
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Further Resources 


The papers cited in this chapter are: 
e Rotando, Louis, and Edward Thorp (1992). “The Kelly Criterion and the Stock 
Market.” The American Mathematical Monthly, Vol. 99, No. 10, pp. 922-931. 
e Hung, Jane (2010): “Betting with the Kelly Criterion.” http://bit.ly/ 
betting_with_kelly. 


For a comprehensive online training program covering Python for algorithmic trad- 
ing see http://certificate.tpq. io. 
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PART V 
Derivatives Analytics 


This part of the book is concerned with the development of a smaller, but neverthe- 
less still powerful, real-world application for the pricing of options and derivatives by 
Monte Carlo simulation.’ The goal is to have, in the end, a set of Python classes—a 
pricing library called DX, for Derivatives analytiX—that allows for the following: 


Modeling 
To model short rates for discounting purposes; to model European and Ameri- 
can options, including their underlying risk factors as well as their relevant mar- 
ket environments; to model even complex portfolios consisting of multiple 
options with multiple (possibly correlated) underlying risk factors 


Simulation 
To simulate risk factors based on geometric Brownian motion and jump diffu- 
sions as well as on square-root diffusions, and to simulate a number of such risk 
factors simultaneously and consistently, whether they are correlated or not 


Valuation 
To value, by the risk-neutral valuation approach, European and American 
options with arbitrary payoffs; to value portfolios composed of such options in a 
consistent, integrated fashion (“global valuation”) 


1 See Bittman (2009) for an introduction to options trading and related topics like market fundamentals and 
the role of the so-called Greeks in options risk management. 


Risk management 
To estimate numerically the most important Greeks—i.e., the delta and the vega 
of an option/derivative—independent of the underlying risk factor or the exer- 
cise type 


Application 
To use the package to value and manage a portfolio of non-traded American 
options on the DAX 30 stock index in market-consistent fashion; i.e., based on a 
calibrated model for the DAX 30 index 


The material presented in this part of the book relies on the DX analytics package, 
which is developed and maintained by the author and The Python Quants GmbH 
(and available, e.g., via the Quant Platform). The full-fledged version allows, for 
instance, the modeling, pricing, and risk management of complex multi-risk deriva- 
tives and trading books composed thereof. 


This part is divided into the following chapters: 


Chapter 17 presents the valuation framework in both theoretical and technical 
form. Theoretically, the Fundamental Theorem of Asset Pricing and the risk- 
neutral valuation approach are central. Technically, the chapter presents Python 
classes for risk-neutral discounting and for market environments. 


Chapter 18 is concerned with the simulation of risk factors based on geometric 
Brownian motion, jump diffusions, and square-root diffusion processes; a 
generic class and three specialized classes are discussed. 


Chapter 19 addresses the valuation of single derivatives with European or Ameri- 
can exercise based on a single underlying risk factor; again, a generic and two 
specialized classes represent the major building blocks. The generic class allows 
the estimation of the delta and the vega independent of the option type. 


Chapter 20 is about the valuation of possibly complex derivatives portfolios with 
multiple derivatives based on multiple possibly correlated underlyings; a simple 
class for the modeling of a derivatives position is presented as well as a more 
complex class for a consistent portfolio valuation. 


Chapter 21 uses the DX library developed in the other chapters to value and risk- 
manage a portfolio of American put options on the DAX 30 stock index. 


CHAPTER 17 
Valuation Framework 


Compound interest is the greatest mathematical discovery of all time. 


—Albert Einstein 


This chapter provides the framework for the development of the DX library by intro- 
ducing the most fundamental concepts needed for such an undertaking. It briefly 
reviews the Fundamental Theorem of Asset Pricing, which provides the theoretical 
background for the simulation and valuation. It then proceeds by addressing the fun- 
damental concepts of date handling and risk-neutral discounting. This chapter con- 
siders only the simplest case of constant short rates for the discounting, but more 
complex and realistic models can be added to the library quite easily. This chapter 
also introduces the concept of a market environment—i.e., a collection of constants, 
lists, and curves needed for the instantiation of almost any other class to come in sub- 
sequent chapters. 


The chapter comprises the following sections: 


“Fundamental Theorem of Asset Pricing” on page 558 
This section introduces the Fundamental Theorem of Asset Pricing, which pro- 
vides the theoretical background for the library to be developed. 


“Risk-Neutral Discounting” on page 560 
This section develops a class for the risk-neutral discounting of future payoffs of 
options and other derivative instruments. 


“Market Environments” on page 565 
This section develops a class to manage market environments for the pricing of 
single instruments and portfolios composed of multiple instruments. 
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Fundamental Theorem of Asset Pricing 


The Fundamental Theorem of Asset Pricing is one of the cornerstones and success 
stories of modern financial theory and mathematics.’ The central notion underlying 
the theorem is the concept of a martingale measure; i.e., a probability measure that 
removes the drift from a discounted risk factor (stochastic process). In other words, 
under a martingale measure, all risk factors drift with the risk-free short rate—and 
not with any other market rate involving some kind of risk premium over the risk- 
free short rate. 


A Simple Example 


Consider a simple economy at the dates today and tomorrow with a risky asset, a 
“stock,” and a riskless asset, a “bond.” The bond costs 10 USD today and pays off 10 
USD tomorrow (zero interest rates). The stock costs 10 USD today and, with a proba- 
bility of 60% and 40%, respectively, pays off 20 USD or 0 USD tomorrow. The riskless 


return of the bond is 0. The expected return of the stock is convene - 1 = 0.2, or 


20%. This is the risk premium the stock pays for its riskiness. 


Consider now a call option with strike price of 15 USD. What is the fair value of such 
a contingent claim that pays 5 USD with 60% probability and 0 USD otherwise? One 
can take the expectation, for example, and discount the resulting value back (here 
with zero interest rates). This approach yields a value of 0.6 - 5 = 3 USD, since the 
option pays 5 USD in the case where the stock price moves up to 20 USD and 0 USD 
otherwise. 


However, there is another approach that has been successfully applied to option pric- 
ing problems like this: replication of the option’s payoff through a portfolio of traded 
securities. It is easily verified that buying 0.25 of the stock perfectly replicates the 
option’s payoff (in the 60% case one then has 0.25 - 20 = 5 USD). A quarter of the 
stock only costs 2.5 USD and not 3 USD. Taking expectations under the real-world 
probability measure overvalues the option. 


Why is this the case? The real-world measure implies a risk premium of 20% for the 
stock since the risk involved in the stock (gaining 100% or losing 100%) is “real” in 
the sense that it cannot be diversified or hedged away. On the other hand, there is a 
portfolio available that replicates the option’s payoff without any risk. This also 
implies that someone writing (selling) such an option can completely hedge away any 


1 Refer to Delbaen and Schachermayer (2004) for a comprehensive review and details of the mathematical 
machinery involved. See also Chapter 4 of Hilpisch (2015) for a shorter introduction, in particular for the 
discrete time version. 
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risk.* Such a perfectly hedged portfolio of an option and a hedge position must yield 
the riskless rate in order to avoid arbitrage opportunities (i.e., the opportunity to 
make some money out of no money with a positive probability). 


Can one save the approach of taking expectations to value the call option? Yes, it is 
possible. One “only” has to change the probability in such a way that the risky asset, 
the stock, drifts with the riskless short rate of zero. Obviously, a (martingale) measure 


giving equal mass of 50% to both scenarios accomplishes this; the calculation is 
ee - 1 = 0. Now, taking expectations of the option’s payoff under the new 


martingale measure yields the correct (arbitrage-free) fair value: 0.5 + 5 + 0.5 +0 = 2.5 
USD. 


The General Results 


The beauty of this approach is that it carries over to even the most complex econo- 
mies with, for example, continuous time modeling (i.e., a continuum of points in 
time to consider), large numbers of risky assets, complex derivative payoffs, etc. 


Therefore, consider a general market model in discrete time:’ 


A general market model Min discrete time is a collection of: 


e A finite state space Q 
e A filtration F 
© A strictly positive probability measure P defined on (Q) 
e A terminal date T € N, T < œ 
e A set S = {(S"),<..7) : k € {0, .... K}} of K+1 strictly positive security price pro- 
cesses 
Together one has M= {(Q, P(Q), F, P), T, S}. 


Based on such a general market model, one can formulate the Fundamental Theorem 
of Asset Pricing as follows:* 


2 The strategy would involve selling an option at a price of 2.5 USD and buying 0.25 stocks for 2.5 USD. The 
payoff of such a portfolio is 0 no matter what scenario plays out in the simple economy. 


3 See Williams (1991) on the probabilistic concepts. 
4 See Delbaen and Schachermayer (2004). 
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Consider the general market model M. According to the Fundamental Theorem of 
Asset Pricing, the following three statements are equivalent: 


e There are no arbitrage opportunities in the market model M. 
e The set Q of P-equivalent martingale measures is nonempty. 
e The set P of consistent linear price systems is nonempty. 


When it comes to valuation and pricing of contingent claims (i.e., options, deriva- 
tives, futures, forwards, swaps, etc.), the importance of the theorem is illustrated by 
the following corollary: 


If the market model Ml is arbitrage-free, then there exists a unique price Vy associated 
with any attainable (i-e., replicable) contingent claim (option, derivative, etc.) Vr. It 
satisfies VQ € Q: V, =ER(e V,), where e~”” is the relevant risk-neutral discount 
factor for a constant short rate r. 


This result illustrates the importance of the theorem, and shows that our simple rea- 
soning from earlier indeed carries over to the general market model. 


Due to the role of the martingale measure, this approach to valuation is also often 
called the martingale approach, or—since under the martingale measure all risky 
assets drift with the riskless short rate—the risk-neutral valuation approach. The sec- 
ond term might, for our purposes, be the better one because in numerical applica- 
tions, one “simply” lets the risk factors (stochastic processes) drift by the risk-neutral 
short rate. One does not have to deal with the probability measures directly for our 
applications—they are, however, what theoretically justifies the central theoretical 
results applied and the technical approach implemented. 


Finally, consider market completeness in the general market model: 


The market model M is complete if it is arbitrage-free and if every contingent claim 
(option, derivative, etc.) is attainable (i.e., replicable). 


Suppose that the market model M is arbitrage-free. The market model is complete if 
and only if Klis a singleton; i.e., if there is a unique P-equivalent martingale measure. 
This mainly completes the discussion of the theoretical background for what follows. 
For a detailed exposition of the concepts, notions, definitions, and results, refer to 
Chapter 4 of Hilpisch (2015). 


Risk-Neutral Discounting 


Obviously, risk-neutral discounting is central to the risk-neutral valuation approach. 
This section therefore develops a Python class for risk-neutral discounting. However, 
it pays to first have a closer look at the modeling and handling of relevant dates for a 
valuation. 
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Modeling and Handling Dates 


A necessary prerequisite for discounting is the modeling of dates (see also Appen- 
dix A). For valuation purposes, one typically divides the time interval between today 
and the final date of the general market model T into discrete time intervals. These 
time intervals can be homogeneous (i.e., of equal length), or they can be heterogene- 
ous (i.e., of varying length). A valuation library should be able to handle the more 
general case of heterogeneous time intervals, since the simpler case is then automati- 
cally included. Therefore, the code works with lists of dates, assuming that the small- 
est relevant time interval is one day. This implies that intraday events are considered 
irrelevant, for which one would have to model time (in addition to dates).° 


To compile a list of relevant dates, one can basically take one of two approaches: con- 
structing a list of concrete dates (e.g., as datetime objects in Python) or of year frac- 
tions (as decimal numbers, as is often done in theoretical works). 


Some imports first: 


In [1]: import numpy as np 
import pandas as pd 
import datetime as dt 


In [2]: from pylab import mpl, plt 
plt.style.use('seaborn') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


In [3]: import sys 
sys.path. append('../dx') 


For example, the following two definitions of dates and fractions are (roughly) 
equivalent: 


In [4]: dates = [dt.datetime(2020, 1, 1), dt.datetime(2020, 7, 1), 
dt.datetime(2021, 1, 1)] 


In [5]: (dates[1] - dates[0]).days / 365. 
Out[5]: 0.4986301369863014 


In [6]: (dates[2] - dates[1]).days / 365. 
Out[6]: 0.5041095890410959 


In [7]: fractions = [0.0, 0.5, 1.0] 


5 Adding a time component is actually a straightforward undertaking, which is nevertheless not done here for 
the ease of the exposition. 
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They are only roughly equivalent since year fractions seldom lie on the beginning (0 
a.m.) of a certain day. Just consider the result of dividing a year by 50. 


Sometimes it is necessary to get year fractions out of a list of dates. The function 
get_year_deltas() does the job: 


# 

# DX Package 

# 

# Frame -- Helper Function 
# 

# get_year_deltas.py 

# 

# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 
# 

import numpy as np 


def get_year_deltas(date_list, day_count=365.): 
''' Return vector of floats with day deltas in year fractions. 
Initial value normalized to zero. 


Parameters 


date list: list or array 

collection of datetime objects 
day_count: float 

number of days for a year 

(to account for different conventions) 


Results 


delta_list: array 
year fractions 


PEL 


start = date_list[0] 

delta_list = [(date - start).days / day_count 
for date in date_list] 

return np.array(delta_Llist) 


This function can then be applied as follows: 


In [8]: from get_year_deltas import get_year_deltas 


In [9]: get_year_deltas(dates) 
Out[9]: array([0. , 0.49863014, 1.00273973]) 


When modeling the short rate, it becomes clear what the benefit of this conversion is. 


562 | Chapter 17: Valuation Framework 


Constant Short Rate 


The exposition to follow focuses on the simplest case for discounting by the short 
rate; namely, the case where the short rate is constant through time. Many option 
pricing models, like the ones of Black-Scholes-Merton (1973), Merton (1976), or 
Cox-Ross-Rubinstein (1979), make this assumption.® Assume continuous discount- 
ing, as is usual for option pricing applications. In such a case, the general discount 
factor as of today, given a future date t and a constant short rate of r, is then given by 
D,(t) =e. Of course, for the end of the economy the special case D,(T) = e~” 
holds true. Note that here both t and T are in year fractions. 


The discount factors can also be interpreted as the value of a unit zero-coupon bond 
(ZCB) as of today, maturing at t and T, respectively.’ Given two dates t > s > 0, the 
discount factor relevant for discounting from t to s is then given by the equation 
Dt) = Di) / Ds) =e" [e* = e™.e" =e. 


The following translates these considerations into Python code in the form ofa class:* 


DX Library 


Frame -- Constant Short Rate Class 


# 

# 

# 

# 

# 

# constant_short_rate.py 
# 

# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 

# 

f 


rom get_year_deltas import * 


class constant_short_rate(object): 
''' Class for constant short rate discounting. 


Attributes 


name: string 
name of the object 
short_rate: float (positive) 


6 For the pricing of, for example, short-dated options, this assumption seems satisfied in many circumstances. 


7 A unit zero-coupon bond pays exactly one currency unit at its maturity and no coupons between today and 
maturity. 


8 See Chapter 6 for the basics of object-oriented programming (OOP) in Python. Here, and for the rest of this 
part, the naming deviates from the standard PEP 8 conventions with regard to Python class names. PEP 8 
recommends using “CapWords” or “CamelCase” convention in general for Python class names. The code in 
this part rather uses the function name convention as mentioned in PEP 8 as a valid alternative “in cases 
where the interface is documented and used primarily as a callable.” 
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constant rate for discounting 


Methods 


get_discount_factors: 
get discount factors given a list/array of datetime objects 
or year fractions 


def __ init__(self, name, short_rate): 
self.name = name 
self.short_rate = short_rate 
if short_rate < 0: 
raise ValueError('Short rate negative. ') 
# this is debatable given recent market realities 


def get_discount_factors(self, date_list, dtobjects=True): 
if dtobjects is True: 
dlist = get_year_deltas(date_list) 
else: 
dlist = np.array(date_list) 
dflist = np.exp(self.short_rate * np.sort(-dlist)) 
return np.array((date_list, dflist)).T 


The application of the class dx. constant_short_rate is best illustrated by a simple, 
concrete example. The main result is a two-dimensional ndarray object containing 
pairs of a datetime object and the relevant discount factor. The class in general and 
the object csr in particular work with year fractions as well: 


In [10]: from constant_short_rate import constant_short_rate 
In [11]: csr = constant_short_rate('csr', 0.05) 


In [12]: csr.get_discount_factors(dates) 

Out[12]: array([[datetime.datetime(2020, 1, 1, 0, 0), 0.9510991280247174], 
[datetime.datetime(2020, 7, 1, 0, 0), 0.9753767163648953], 
[datetime.datetime(2021, 1, 1, 0, 0), 1.0]], dtype=object) 


In [13]: deltas = get_year_deltas(dates) 
deltas 
Out[13]: array([0. » 0.49863014, 1.00273973]) 


In [14]: csr.get_discount_factors(deltas, dtobjects=False) 
Out[14]: array([[0. » 0.95109913], 

[0.49863014, 0.97537672], 

[1.00273973, 1. 11) 


This class will take care of all discounting operations needed in other classes. 
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Market Environments 


Market environment is “just” a name for a collection of other data and Python 
objects. However, it is rather convenient to work with this abstraction since it simpli- 
fies a number of operations and also allows for a consistent modeling of recurring 
aspects.’ A market environment mainly consists of three dictionaries to store the fol- 
lowing types of data and Python objects: 


Constants 
These can be, for example, model parameters or option maturity dates. 


Lists 
These are collections of objects in general, like a list of objects modeling (risky) 
securities. 


Curves 
These are objects for discounting; e.g, an instance of the dx.con 
stant_short_rate class. 


Following is the code for the dx.market_environment class. Refer to Chapter 3 for 
details on the handling of dict objects: 


DX Package 


Frame -- Market Environment Class 


we RRR 


+ 


market_environment.py 


* 


# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 
# 


class market_environment(object): 
''' Class to model a market environment relevant for valuation. 


Attributes 


name: string 

name of the market environment 
pricing_date: datetime object 

date of the market environment 


Methods 


add_constant: 


9 On this concept see also Fletcher and Gardner (2009), who use market environments extensively. 
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adds a constant (e.g. model parameter) 
get_constant: 

gets a constant 
add_list: 

adds a list (e.g. underlyings) 
get_list: 

gets a list 
add_curve: 

adds a market curve (e.g. yield curve) 
get_curve: 

gets a market curve 
add_environment: 

adds and overwrites whole market environments 

with constants, lists, and curves 


def __init__(self, name, pricing_date): 
self.name = name 
self.pricing_date = pricing_date 
self.constants = {} 
self.lists = {} 
self.curves = {} 


def add_constant(self, key, constant): 
self.constants[key] = constant 


def get_constant(self, key): 
return self.constants[key] 


def add_list(self, key, list_object): 
self.lists[key] = list_object 


def get_list(self, key): 
return self.lists[key] 


def add_curve(self, key, curve): 
self.curves[key] = curve 


def get_curve(self, key): 
return self.curves[key] 


def add_environment(self, env): 
# overwrites existing values, if they exist 
self .constants.update(env.constants) 
self. lists.update(env. lists) 
self.curves.update(env.curves) 


Although there is nothing really special about the dx.market_environment class, a 
simple example shall illustrate how convenient it is to work with instances of the 
class: 
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In [15]: from market_environment import market_environment 

In [16]: me = market_environment('me_gbm', dt.datetime(2020, 1, 1)) 
In [17]: me.add_constant('initial_value', 36.) 

In [18]: me.add_constant('volatility', 0.2) 

In [19]: me.add_constant('final_date', dt.datetime(2020, 12, 31)) 
In [20]: me.add_constant('currency', 'EUR') 

In [21]: me.add_constant('frequency', 'M') 

In [22]: me.add_constant('paths', 10000) 

In [23]: me.add_curve('discount_curve', csr) 


In [24]: me.get_constant('volatility') 
Out[24]: 0.2 


In [25]: me.get_curve('discount_curve').short_rate 

Out[25]: 0.05 
This illustrates the basic handling of this rather generic “storage” class. For practical 
applications, market data and other data as well as Python objects are first collected, 
then a dx.market_environment object is instantiated and filled with the relevant data 
and objects. This is then delivered in a single step to other classes that need the data 
and objects stored in the respective dx.market_environment object. 


A major advantage of this object-oriented modeling approach is, for example, that 
instances of the dx.constant_short_rate class can live in multiple environments 
(see the topic of aggregation in Chapter 6). Once the instance is updated—for exam- 
ple, when a new constant short rate is set—all the instances of the dx.market_envi 
ronment class containing that particular instance of the discounting class will be 
updated automatically. 


Flexibility 


The market environment class as introduced in this section is a 
flexible means to model and store any quantities and input data 
relevant to the pricing of options and derivatives and portfolios 
composed thereof. However, this flexibility also leads to opera- 
tional risks in that it is easy to pass nonsensical data, objects, etc. to 
the class during instantiation, which might or might not be cap- 
tured during instantiation. In a production context, a number of 
checks need to be added to at least capture obviously wrong cases. 
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Conclusion 


This chapter provides the basic framework for the larger project of building a Python 
package to value options and other derivatives by Monte Carlo simulation. The chap- 
ter introduces the Fundamental Theorem of Asset Pricing, illustrating it by a rather 
simple numerical example. Important results in this regard are provided for a general 
market model in discrete time. 


The chapter also develops a Python class for risk-neutral discounting purposes to 
make numerical use of the mathematical machinery of the Fundamental Theorem of 
Asset Pricing. Based on a list object of either Python datetime objects or float 
objects representing year fractions, instances of the class dx.constant_short_rate 
provide the appropriate discount factors (present values of unit zero-coupon bonds). 


The chapter concludes with the rather generic dx.market_environment class, which 
allows for the collection of relevant data and Python objects for modeling, simula- 
tion, valuation, and other purposes. 


To simplify future imports, a wrapper module called dx_frame.py is used: 


DX Analytics Package 
Frame Functions & Classes 
dx_frame.py 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 
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import datetime as dt 


from get_year_deltas import get_year_deltas 
from constant_short_rate import constant_short_rate 
from market_environment import market_environment 


A single import statement like the following then makes all framework components 
available in a single step: 


import dx_frame 


Thinking of a Python package of modules, there is also the option to store all relevant 
Python modules in a (sub)folder and to put in that folder a special __init__.py file 
that does all the imports. For example, when storing all modules in a folder called dx, 
say, the file presented next does the job. However, notice the naming convention for 
this particular file: 
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# 

# DX Package 

# packaging file 

# __init__.py 

# 

import datetime as dt 


from get_year_deltas import get_year_deltas 
from constant_short_rate import constant_short_rate 
from market_environment import market_environment 


In that case you can just use the folder name to accomplish all the imports at once: 


from dx import * 


Or, via the alternative approach: 


import dx 


Further Resources 


Useful references in book form for the topics covered in this chapter are: 


Bittman, James (2009). Trading Options as a Professional. New York: McGraw 
Hill. 


Delbaen, Freddy, and Walter Schachermayer (2004). The Mathematics of 
Arbitrage. Berlin, Heidelberg: Springer-Verlag. 

Fletcher, Shayne, and Christopher Gardner (2009). Financial Modelling in 
Python. Chichester, England: Wiley Finance. 

Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


Williams, David (1991). Probability with Martingales. Cambridge, England: 
Cambridge University Press. 


For the original research papers defining the models cited in this chapter, refer to the 
“Further Resources” sections in subsequent chapters. 
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CHAPTER 18 
Simulation of Financial Models 


The purpose of science is not to analyze or describe but to make useful models of the 
world. 


—Edward de Bono 


Chapter 12 introduces in some detail the Monte Carlo simulation of stochastic pro- 
cesses using Python and NumPy. This chapter applies the basic techniques presented 
there to implement simulation classes as a central component of the DX package. The 
set of stochastic processes is restricted to three widely used ones. In particular, the 
chapter comprises the following sections: 


“Random Number Generation” on page 572 
This section develops a function to generate standard normally distributed ran- 
dom numbers using variance reduction techniques.' 


“Generic Simulation Class” on page 574 
This section develops a generic simulation class from which the other specific 
simulatation classes inherit fundamental attributes and methods. 


“Geometric Brownian Motion” on page 577 
This section is about the geometric Brownian motion (GBM) that was intro- 
duced to the option pricing literature through the seminal works of Black and 
Scholes (1973) and Merton (1973); it is used several times throughout this book 
and still represents—despite its known shortcomings and given the mounting 
empirical evidence against it—a benchmark process for option and derivative 
valuation purposes. 


1 The text speaks of “random” numbers knowing that they are in general “pseudo-random” only. 
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‘Jump Diffusion” on page 582 
The jump diffusion, as introduced to finance by Merton (1976), adds a log- 
normally distributed jump component to the GBM. This allows one to take into 
account that, for example, short-term out-of-the-money (OTM) options often 
seem to have priced in the possibility of larger jumps; in other words, relying on 
GBM as a financial model often cannot explain the market values of such OTM 
options satisfactorily, while a jump diffusion may be able to do so. 


“Square-Root Diffusion” on page 587 
The square-root diffusion, popularized in finance by Cox, Ingersoll, and Ross 
(1985), is used to model mean-reverting quantities like interest rates and volatil- 
ity; in addition to being mean-reverting, the process stays positive, which is gen- 
erally a desirable characteristic for those quantities. 


For further details on the simulation of the models presented in this chapter, refer 
also to Hilpisch (2015). In particular, that book contains a complete case study based 
on the jump diffusion model of Merton (1976). 


Random Number Generation 


Random number generation is a central task of Monte Carlo simulation.” Chapter 12 
shows how to use Python and subpackages such as numpy.random to generate random 
numbers with different distributions. For the project at hand, standard normally dis- 
tributed random numbers are the most important ones. That is why it pays off to 
have the convenience function sn_random_numbers(), defined here, available for 
generating this particular type of random numbers: 


DX Package 
Frame -- Random Number Generation 


sn_random_numbers.py 


Python for Finance, 2nd ed. 
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import numpy as np 


def sn_random_numbers(shape, antithetic=True, moment_matching=True, 
fixed_seed=False): 
''' Returns an ndarray object of shape shape with (pseudo)random numbers 
that are standard normally distributed. 


2 See Glasserman (2004), Chapter 2, on generating random numbers and random variables. 
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Parameters 


shape: tuple (o, n, m) 

generation of array with shape (o, n, m) 
antithetic: Boolean 

generation of antithetic variates 
moment_matching: Boolean 

matching of first and second moments 
fixed_seed: Boolean 

flag to fix the seed 


Results 


ran: (o, n, m) array of (pseudo)random numbers 
if fixed_seed: 
np.random.seed(1000) 
if antithetic: 
ran = np.random.standard_normal( 
(shape[0], shape[1], shape[2] // 2)) 
ran = np.concatenate((ran, -ran), axis=2) 
else: 
ran = np.random.standard_normal(shape) 
if moment_matching: 
ran = ran - np.mean(ran) 
ran = ran / np.std(ran) 
if shape[0] == 1: 
return ran[0] 
else: 
return ran 


The variance reduction techniques used in this function, namely antithetic paths and 
moment matching, are also illustrated in Chapter 12.’ The application of the function 
is straightforward: 


In [26]: from sn_random_numbers import * 


In [27]: snrn = sn_random_numbers((2, 2, 2), antithetic=False, 


moment_matching=False, fixed_seed=True) 
snrn 


Out[27]: array([[[-0.8044583 , ©.32093155], 


[-0.02548288, 0.64432383]], 


[[-9.30079667, 0.38947455], 
[-0.1074373 , -0.47998308]]]) 


In [28]: round(snrn.mean(), 6) 


3 Glasserman (2004) presents in Chapter 4 an overview and theoretical details of different variance reduction 
techniques. 
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Out[28]: -0.045429 


In [29]: round(snrn.std(), 6) 
Out[29]: 0.451876 


In [30]: snrn = sn_random_numbers((2, 2, 2), antithetic=False, 
moment_matching=True, fixed_seed=True) 
snrn 
Out[30]: array([[[-1.67972865, 0.81075283], 
[ 0.04413963, 1.52641815]], 


[[-0.56512826, 0.96243813], 
[-0 


.13722505, -0.96166678]]]) 


In [31]: round(snrn.mean(), 6) 
Out[31]: -0.0 


In [32]: round(snrn.std(), 6) 
Out[32]: 1.0 


This function will prove a workhorse for the simulation classes to follow. 


Generic Simulation Class 


Object-oriented modeling—as introduced in Chapter 6—allows inheritance of 
attributes and methods. This is what the following code makes use of when building 
the simulation classes: one starts with a generic simulation class containing those 
attributes and methods that all other simulation classes share and can then focus with 
the other classes on specific elements of the stochastic process to be simulated. 


Instantiating an object of any simulation class happens by providing three attributes 
only: 


name 
A str object as a name for the model simulation object 


mar_env 
An instance of the dx.market_environment class 


corr 
A flag (bool) indicating whether the object is correlated or not 


This again illustrates the role of a market environment: to provide in a single step all 
data and objects required for simulation and valuation. The methods of the generic 
class are: 


generate_time_grid() 
This method generates the time grid of relevant dates used for the simulation; 
this task is the same for every simulation class. 
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get_instrument_values() 
Every simulation class has to return the ndarray object with the simulated 
instrument values (e.g., simulated stock prices, commodities prices, volatilities). 


The code for the generic model simulation class follows. The methods make use of 
other methods that the model-tailored classes will provide, like self.gener 
ate_paths(). The details in this regard become clear when one has the full picture of 
a specialized, nongeneric simulation class. First, the base class: 

DX Package 

Simulation Class -- Base Class 


simulation_class.py 
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import numpy as np 
import pandas as pd 


class simulation_class(object): 
''' Providing base methods for simulation classes. 


Attributes 


name: str 

name of the object 
mar_env: instance of market_environment 

market environment data for simulation 
corr: bool 

True if correlated with other model object 


Methods 


generate_time_grid: 

returns time grid for simulation 
get_instrument_values: 

returns the current instrument values (array) 


ttrt 


def __init__(self, name, mar_env, corr): 
self.name = name 
self.pricing_date = mar_env.pricing_date 
self.initial_value = mar_env.get_constant('initial_value') 
self.volatility = mar_env.get_constant('volatility') 
self.final_date = mar_env.get_constant('final_date') 
self.currency = mar_env.get_constant('currency') 
self.frequency = mar_env.get_constant('frequency') 
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self.paths = mar_env.get_constant('paths') 
self.discount_curve = mar_env.get_curve('discount_curve') 
try: 

# if time_grid in mar_env take that object 

# (for portfolio valuation) 

self.time_grid = mar_env.get_list('time_grid') 
except: 

self.time_grid = None 
try: 

# if there are special dates, then add these 

self.special_dates = mar_env.get_list('special_dates') 
except: 

self.special_dates 
self.instrument_values 
self.correlated = corr 
if corr is True: 

# only needed in a portfolio context when 

# risk factors are correlated 

self.cholesky_matrix = mar_env.get_list('cholesky_matrix') 

self.rn_set = mar_env.get_list('rn_set')[self.name] 

self.random_numbers = mar_env.get_list('random_numbers') 


[] 


None 


def generate_time_grid(self): 
start = self.pricing_date 
end = self.final_date 
# pandas date_range function 
# freq = e.g. 'B' for Business Day, 
# 'W' for Weekly, 'M' for Monthly 
time_grid = pd.date_range(start=start, end=end, 
freq=self.frequency).to_pydatetime() 
time_grid = list(time_grid) 
# enhance time_grid by start, end, and special_dates 
if start not in time_grid: 
time_grid.insert(0, start) 
# insert start date if not in list 
if end not in time_grid: 
time_grid.append(end) 
# insert end date if not in list 
if len(self.special_dates) > 0: 
# add all special dates 
time_grid.extend(self.special_dates) 
# delete duplicates 
time_grid = list(set(time_grid)) 
# sort list 
time_grid.sort() 
self.time_grid = np.array(time_grid) 


def get_instrument_values(self, fixed_seed=True): 
if self.instrument_values is None: 
# only initiate simulation if there are no instrument values 
self .generate_paths(fixed_seed=fixed_seed, day_count=365.) 
elif fixed_seed is False: 
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# also initiate resimulation when fixed_seed is False 
self .generate_paths(fixed_seed=fixed_seed, day_count=365.) 
return self.instrument_vaLlues 


Parsing of the market environment is embedded in the special method __init__(), 
which is called during instantiation. To keep the code concise, there are no sanity 
checks implemented. For example, the following line of code is considered a “suc- 
cess,” no matter if the content is indeed an instance of a discounting class or not. 
Therefore, one has to be rather careful when compiling and passing dx.market_envi 
ronment objects to any simulation class: 


self.discount_curve = mar_env.get_curve('discount_curve') 
Table 18-1 shows all components a dx.market_environment object must contain for 


the generic and therefore for all other simulation classes. 


Table 18-1. Elements of the market environment for all simulation classes 


Element Type Mandatory Description 

initial_value Constant Yes Initial value of process at pricing_date 
volatility Constant Yes Volatility coefficient of process 

final_date Constant Yes Simulation horizon 

currency Constant Yes Currency of the financial entity 

frequency Constant Yes Date frequency, as pandas freq parameter 

paths Constant Yes Number of paths to be simulated 

discount_curve Curve Yes Instance of dx. constant_short_rate 
time_grid List No Time grid of relevant dates (in portfolio context) 
random_numbers __ List No Random number np.ndarray object (for correlated objects) 
cholesky_matrix List No Cholesky matrix (for correlated objects) 

rn_set List No dict object with pointer to relevant random number set 


Everything that has to do with the correlation of model simulation objects is 
explained in subsequent chapters. In this chapter, the focus is on the simulation of 
single, uncorrelated processes. Similarly, the option to pass a time_grid is only rele- 
vant in a portfolio context, something also explained later. 


Geometric Brownian Motion 


Geometric Brownian motion is a stochastic process, as described in Equation 18-1 
(see also Equation 12-2 in Chapter 12, in particular for the meaning of the parame- 
ters and variables). The drift of the process is already set equal to the riskless, con- 
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stant short rate r, implying that one operates under the equivalent martingale 
measure (see Chapter 17). 


Equation 18-1. Stochastic differential equation of geometric Brownian motion 

dS, = rS,dt + 0S,dZ, 
Equation 18-2 presents an Euler discretization of the stochastic differential equation 
for simulation purposes (see also Equation 12-3 in Chapter 12 for further details). 


The general framework is a discrete time market model, such as the general market 
model M from Chapter 17, with a finite set of relevant dates 0 < t, < t, < ... < T. 


Equation 18-2. Difference equation to simulate the geometric Brownian motion 
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The Simulation Class 
Following is the specialized class for the GBM model: 


# 

# DX Package 

# 

# Simulation Class -- Geometric Brownian Motion 
# 

# geometric_brownian_motion. py 
# 

# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 

# 

import numpy as np 


from sn_random_numbers import sn_random_numbers 
from simulation_class import simulation_class 


class geometric_brownian_motion(simulation_class): 
"'' Class to generate simulated paths based on 
the Black-Scholes-Merton geometric Brownian motion model. 


Attributes 


name: string 
name of the object 
mar_env: instance of market_environment 
market environment data for simulation 
corr: Boolean 
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True if correlated with other model simulation object 


Methods 


update: 
updates parameters 
generate_paths: 
returns Monte Carlo paths given the market environment 


rr 


def __init__(self, name, mar_env, corr=False): 
super(geometric_brownian_motion, self).__init__ (name, mar_env, corr) 


def update(self, initial_value=None, volatility=None, final_date=None): 
if initial_value is not None: 
self.initial_value = initial_value 
if volatility is not None: 
self.volatility = volatility 
if final_date is not None: 
self.final_date = final_date 
self.instrument_values = None 


def generate_paths(self, fixed_seed=False, day_count=365.): 
if self.time_grid is None: 
# method from generic simulation class 
self .generate_time_grid() 
# number of dates for time grid 
M = Llen(self.time_grid) 
# number of paths 
I = self.paths 
# ndarray initialization for path simulation 
paths = np.zeros((M, 1I)) 
# initialize first date with initial_value 
paths[0] = self.initial_value 
if not self.correlated: 
# if not correlated, generate random numbers 
rand = sn_random_numbers((1, M, I), 
fixed_seed=fixed_seed) 
else: 
# if correlated, use random number object as provided 
# in market environment 
rand = self.random_numbers 
short_rate = self.discount_curve.short_rate 
# get short rate for drift of process 
for t in range(1, len(self.time_grid)): 
# select the right time slice from the relevant 
# random number set 
if not self.correlated: 
ran = rand[t] 
else: 
ran = np.dot(self.cholesky_matrix, rand[:, t, :]) 
ran = ran[self.rn_set] 
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dt = (self.time_grid[t] - self.time_grid[t - 1]).days / day_count 
# difference between two dates as year fraction 
paths[t] = paths[t - 1] * np.exp((short_rate - 0.5 * 
self.volatility ** 2) * dt + 
self.volatility * np.sqrt(dt) * ran) 
# generate simulated values for the respective date 
self.instrument_vaLlues = paths 


In this particular case, the dx.market_environment object has to contain only the 
data and objects shown in Table 18-1—i.e., the minimum set of components. 


The method update() does what its name suggests: it allows the updating of selected 
important parameters of the model. The method generate_paths() is, of course, a 
bit more involved. However, it has a number of inline comments that should make 
clear the most important aspects. Some complexity is brought into this method by, in 
principle, allowing for the correlation between different model simulation objects— 
the purpose of which will become clearer later, especially in Chapter 20. 


A Use Case 


The following interactive [Python session illustrates the use of the GBM simulation 
class. First, one has to generate a dx.market_environment object with all the manda- 
tory elements: 


In [33]: from dx_frame import * 

In [34]: me_gbm = market_environment('me_gbm', dt.datetime(2020, 1, 1)) 

In [35]: me_gbm.add_constant('initial_value', 36.) 
me_gbm.add_constant('volatility', 0.2) 
me_gbm.add_constant('final_date', dt.datetime(2020, 12, 31)) 
me_gbm.add_constant('currency', 'EUR') 
me_gbm.add_constant('frequency', 'M') (1) 
me_gbm.add_constant('paths', 10000) 


In [36]: csr = constant_short_rate('csr', 0.06) 
In [37]: me_gbm.add_curve('discount_curve', csr) 
@ Monthly frequency with month end as default. 


Second, one instantiates a model simulation object to work with: 


In [38]: from geometric_brownian_motion import geometric_brownian_motion 
In [39]: gbm = geometric_brownian_motion('gbm', me_gbm) (13 
In [40]: gbm.generate_time_grid() (2) 


In [41]: gbm.time_grid © 
Out[41]: array([datetime.datetime(2020, 1, 1, 0, 0), 
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datetime.datetime(2020, 1, 31, 0, 0), 
datetime.datetime(2020, 2, 29, 0, 0), 
datetime.datetime(2020, 3, 31, 0, 0), 
datetime.datetime(2020, 4, 30, 0, 0), 
datetime.datetime(2020, 5, 31, 0, 0), 
datetime.datetime(2020, 6, 30, 0, 0), 
datetime.datetime(2020, 7, 31, 0, 0), 
datetime.datetime(2020, 8, 31, 0, 0), 
datetime.datetime(2020, 9, 30, 0, 0), 


datetime.datetime(2020, 10, 31, 0, 0), 
datetime.datetime(2020, 11, 30, 0, 0), 
datetime.datetime(2020, 12, 31, 0, 0)], dtype=object) 


In [42]: %time paths_1 = gbm.get_instrument_values() (4) 
CPU times: user 21.3 ms, sys: 6.74 ms, total: 28.1 ms 
Wall time: 40.3 ms 


In [43]: paths_1.round(3) (4) 

Out[43]: array([[36. 5 36% 5 Or Reap OO 530% s 36. T; 
[37.403, 38.12 , 34.4 , s.. 36.252, 35.084, 39.668], 
[39.562, 42,335, 32.405, «<<; 34,836, 33.637,. 37.655], 
[40.534, 33.506, 23.497, ..., 37.851, 30.122, 30.446], 
[42.527, 36.995, 21.885, ..., 36.014, 30.907, 30.712], 
[43.811, 37.876, 24.1 , ..., 36.263, 28.138, 29.038]]) 


In [44]: gbm.update(volatility=0.5) 6 


In [45]: %time paths_2 = gbm.get_instrument_values() (5) 
CPU times: user 27.8 ms, sys: 3.91 ms, total: 31.7 ms 
Wall time: 19.8 ms 


Instantiates the simulation object. 


Generates the time grid ... 


o 

(2) 

© ... and shows it; note that the initial date is added. 
@ Simulates the paths given the parameterization. 
(5) 


Updates the volatility parameter and repeats the simulation. 


Figure 18-1 shows 10 simulated paths for the two different parameterizations. The 
effect of increasing the volatility parameter value is easy to see: 


In [46]: plt.figure(figsize=(10, 6)) 
p1 = plt.plot(gbm.time_grid, paths_1[:, :10], 'b') 
p2 = plt.plot(gbm.time_grid, paths_2[:, :10], 'r-.') 
11 = plt.legend([p1[0], p2[0]], 
['low volatility’, 'high volatility'], loc=2) 
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plt.gca().add_artist(11) 
plt.xticks(rotation=30); 
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Figure 18-1. Simulated paths from GBM simulation class 


Vectorization for Simulation 


As argued and shown already in Chapter 12, vectorization 
approaches using NumPy and pandas are well suited to writing con- 
cise and performant simulation code. 


Jump Diffusion 


Equipped with the background knowledge from the dx.geometric_brownian_motion 
class, it is now straightforward to implement a class for the jump diffusion model 
described by Merton (1976). The stochastic differential equation for the jump diffu- 
sion model is shown in Equation 18-3 (see also Equation 12-8 in Chapter 12, in par- 
ticular for the meaning of the parameters and variables). 


Equation 18-3. Stochastic differential equation for Merton jump diffusion model 
dS, = (r - 1,)S,dt + oS,d Z, + J,S,dN, 
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An Euler discretization for simulation purposes is presented in Equation 18-4 (see 


also Equation 12-9 in Chapter 12 and the more detailed explanations given there). 


Equation 18-4. Euler discretization for Merton jump diffusion model 
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The Simulation Class 
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The Python code for the dx.jump_diffusion simulation class follows. This class 
should by now contain no surprises. Of course, the model is different, but the design 
and the methods are essentially the same: 


RR RRR RR RH HR 


DX Package 
Simulation Class -- Jump Diffusion 
jump_diffusion. py 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


import numpy as np 


from sn_random_numbers import sn_random_numbers 
from simulation_class import simulation_class 


class jump_diffusion(simulation_class): 


"'' Class to generate simulated paths based on 
the Merton (1976) jump diffusion model. 


Attributes 


name: str 
name of the object 
mar_env: instance of market_environment 
market environment data for simulation 
corr: bool 
True if correlated with other model object 


Methods 


update: 
updates parameters 
generate_paths: 


returns Monte Carlo paths given the market environment 
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def __init__(self, name, mar_env, corr=False): 
super(jump_diffusion, self).__init__(name, mar_env, corr) 
# additional parameters needed 
self.lamb = mar_env.get_constant('lLambda' ) 
self.mu = mar_env.get_constant('mu') 
self.delt = mar_env.get_constant('delta') 


def update(self, initial_value=None, volatility=None, Lamb=None, 
mu=None, delta=None, final_date=None): 
if initial_value is not None: 
self.initial_value = initial_value 
if volatility is not None: 
self.volatility = volatility 
if Lamb is not None: 
self.lamb = Lamb 
if mu is not None: 
self.mu = mu 
if delta is not None: 
self.delt = delta 
if final_date is not None: 
self.final_date = final_date 
self.instrument_values = None 


def generate_paths(self, fixed_seed=False, day_count=365.): 
if self.time_grid is None: 
# method from generic simulation class 
self.generate_time_grid() 
# number of dates for time grid 
M = len(self.time_grid) 
# number of paths 
I = self.paths 
# ndarray initialization for path simulation 
paths = np.zeros((M, I)) 
# initialize first date with initial_value 
paths[0] = self.initial_value 
if self.correlated is False: 
# if not correlated, generate random numbers 
sni = sn_random_numbers((1, M, I), 
fixed_seed=fixed_seed) 
else: 
# if correlated, use random number object as provided 
# in market environment 
sni = self.random_numbers 


# standard normally distributed pseudo-random numbers 

# for the jump component 

sn2 = sn_random_numbers((1, M, I), 
fixed_seed=fixed_seed) 


rj = self.lamb * (np.exp(self.mu + 0.5 * self.delt ** 2) - 1) 
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short_rate = self.discount_curve.short_rate 
for t in range(i, len(self.time_grid)): 
# select the right time slice from the relevant 
# random number set 
if self.correlated is False: 
ran = sni[t] 
else: 
# only with correlation in portfolio context 
ran = np.dot(self.cholesky_matrix, snif[:, t, :]) 
ran = ran[self.rn_set] 
dt = (self.time_grid[t] - self.time_grid[t - 1]).days / day_count 
# difference between two dates as year fraction 
poi = np.random.poisson(self.lamb * dt, I) 
# Poisson-distributed pseudo-random numbers for jump component 
paths[t] = paths[t - 1] * ( 
np.exp((short_rate - rj - 
0.5 * self.volatility ** 2) * dt + 
self.volatility * np.sqrt(dt) * ran) + 
(np.exp(self.mu + self.delt * sn2[t]) - 1) * poi) 
self.instrument_values = paths 


Of course, since this is a different model, it needs a different set of elements in the 
dx.market_environment object. In addition to those for the generic simulation class 
(see Table 18-1), there are three parameters required, as outlined in Table 18-2: 
namely, the parameters of the log-normal jump component, Lambda, mu, and delta. 


Table 18-2. Specific elements of the market environment for dx.jump_diffusion class 


Element Type Mandatory Description 


lambda Constant Yes Jump intensity (probability p.a.) 
mu Constant Yes Expected jump size 
delta Constant Yes Standard deviation of jump size 


For the generation of the paths, this class needs further random numbers because of 
the jump component. Inline comments in the method generate_paths() highlight 
the two spots where these additional random numbers are generated. For the genera- 
tion of Poisson-distributed random numbers, see also Chapter 12. 


A Use Case 


The following interactive session illustrates how to use the simulation class 
dx. jump_diffusion. The dx.market_environment object defined for the GBM object 
is used as a basis: 


In [47]: me_jd = market_environment('me_jd', dt.datetime(2020, 1, 1)) 


In [48]: me_jd.add_constant('lambda', 0.3) (1) 


Jump Diffusion | 585 


me_jd.add_constant('mu', -0.75) 1] 
me_jd.add_constant('delta', 0.1) © 


In [49]: me_jd.add_environment(me_gbm) (2) 
In [50]: from jump_diffusion import jump_diffusion 
In [51]: jd = jump_diffusion('jd', me_jd) 


In [52]: %time paths_3 = jd.get_instrument_values() © 
CPU times: user 28.6 ms, sys: 4.37 ms, total: 33 ms 
Wall time: 49.4 ms 


In [53]: jd.update(lamb=0.9) (4) 


In [54]: %time paths_4 = jd.get_instrument_values() (5) 
CPU times: user 29.7 ms, sys: 3.58 ms, total: 33.3 ms 
Wall time: 66.7 ms 


@ The three additional parameters for the dx.jump_diffusion object. These are 
specific to the simulation class. 


Adds a complete environment to the existing one. 


(2) 
© Simulates the paths with the base parameters. 
@ Increases the jump intensity parameters. 

(5) 


Simulates the paths with the updated parameter. 


Figure 18-2 compares a couple of simulated paths from the two sets with low and 
high intensity (jump probability), respectively. It is easy to spot several jumps for the 
low-intensity case and the multiple jumps for the high-intensity case in the figure: 


In [55]: plt.figure(figsize=(10, 6)) 
p1 = plt.plot(gbm.time_grid, paths_3[:, :10], 'b') 
p2 = plt.plot(gbm.time_grid, paths_4[:, :10], 'r-.') 
11 = plt.legend([p1[0], p2[0]], 
['low intensity', 'high intensity'], loc=3) 
plt.gca().add_artist(11) 
plt.xticks(rotation=30); 
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Figure 18-2. Simulated paths from jump diffusion simulation class 


Square-Root Diffusion 


The third stochastic process to be simulated is the square-root diffusion as used, for 
example, by Cox, Ingersoll, and Ross (1985) to model stochastic short rates. Equation 
18-5 shows the stochastic differential equation of the process (see also Equation 12-4 
in Chapter 12 for further details). 


Equation 18-5. Stochastic differential equation of square-root diffusion 
dx, = «(0 - x,)dt + onl x,dZ, 
The code uses the discretization scheme as presented in Equation 18-6 (see also 


Equation 12-5 in Chapter 12, as well as Equation 12-6 for an alternative, exact 
scheme). 


ee 18-6. Euler discretization for square-root diffusion (full truncation 
scheme) 


Kin = x, + «(0 E EASI ane i tm) + oniy Enel = bn 
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The Simulation Class 


Following is the Python code for the dx.square_root_diffusion simulation class, 
which is the third and final one. Apart from, of course, a different model and discreti- 
zation scheme, the class does not contain anything new compared to the other two 
specialized classes: 


RR RRR RR RH HR 


DX Package 
Simulation Class -- Square-Root Diffusion 
square_root_diffusion. py 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


import numpy as np 


from sn_random_numbers import sn_random_numbers 
from simulation_class import simulation_class 


class square_root_diffusion(simulation_class): 


''' Class to generate simulated paths based on 
the Cox-Ingersoll-Ross (1985) square-root diffusion model. 


Attributes 


name : string 
name of the object 
mar_env : instance of market_environment 
market environment data for simulation 
corr : Boolean 
True if correlated with other model object 


Methods 


update : 
updates parameters 
generate_paths : 
returns Monte Carlo paths given the market environment 


aa 


def __init__(self, name, mar_env, corr=False): 
super(square_root_diffusion, self). __init__(name, mar_env, corr) 
# additional parameters needed 
self.kappa = mar_env.get_constant('kappa' ) 
self.theta = mar_env.get_constant('theta') 


def update(self, initial_value=None, volatility=None, kappa=None, 
theta=None, final_date=None): 
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if initial_value is not None: 
self.initial_value = initial_value 
if volatility is not None: 
self.volatility = volatility 
if kappa is not None: 
self.kappa = kappa 
if theta is not None: 
self.theta = theta 
if final_date is not None: 
self.final_date = final_date 
self.instrument_values = None 


def generate_paths(self, fixed_seed=True, day_count=365.): 
if self.time_grid is None: 
self.generate_time_grid() 
M = Len(self.time_grid) 
I = self.paths 
paths = np.zeros((M, I)) 
paths_ = np.zeros_like(paths) 
paths[0] = self.initial_value 
paths_[0] = self.initial_value 
if self.correlated is False: 
rand = sn_random_numbers((1, M, I), 
fixed_seed=fixed_seed) 
else: 
rand = self.random_numbers 


for t in range(1, len(self.time_grid)): 
dt = (self.time_grid[t] - self.time_grid[t - 1]).days / day_count 
if self.correlated is False: 
ran = rand[t] 
else: 
ran = np.dot(self.cholesky_matrix, rand[:, t, :]) 
ran = ran[self.rn_set] 


# full truncation Euler discretization 
paths_[t] = (paths_[t - 1] + self.kappa * 
(self.theta - np.maximum(0, paths_[t - 1, :])) * dt + 
np.sqrt(np.maximum(0, paths_[t - 1, :])) * 
self.volatility * np.sqrt(dt) * ran) 
paths[t] = np.maximum(0, paths_[t]) 
self.instrument_values = paths 


Table 18-3 lists the two elements of the market environment that are specific to this 
class. 


Table 18-3. Specific elements of the market environment for dx.square_root_diffusion class 


Element Type Mandatory Description 


kappa Constant Yes Mean reversion factor 


theta Constant Yes Long-term mean of process 
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A Use Case 


A rather brief example illustrates the use of the simulation class. As usual, one needs 
a market environment, for example, to model a volatility (index) process: 


In 


In 


In 


[56]: me_srd 


[57]: me_srd. 
me_srd. 
me_srd. 
me_srd. 
me_srd. 

.add_constant('paths', 10000) 


me_srd 


[58]: me_srd. 
me_srd. 


[59]: me_srd 


= market_environment('me_srd', dt.datetime(2020, 1, 1)) (1) 


add_constant('initial_value', .25) 
add_constant('volatility', 0.05) 
add_constant('final_date', dt.datetime(2020, 12, 31)) 
add_constant('currency', 'EUR') 
add_constant('frequency', 'W') 


add_constant('kappa', 4.0) 
add_constant('theta', 0.2) 


.add_curve('discount_curve', constant_short_rate('r', 0.0)) (2) 


[60]: from square_root_diffusion import square_root_diffusion 


[61]: srd = square_root_diffusion('srd', me_srd) © 


[62]: srd_paths = srd.get_instrument_values()[:, :10] (4) 


Additional parameters for the dx.square_root_diffusion object. 


The discount_curve object is required by default but not needed for the simula- 
tion. 


© Instantiates the object ... 


© ... simulates the paths, and selects 10. 


Figure 18-3 illustrates the mean-reverting characteristic by showing how the simula- 
ted paths on average revert to the long-term mean theta (dashed line), which is 
assumed to be 0.2: 


In [63]: plt.figure(figsize=(10, 6)) 
plt.plot(srd.time_grid, srd.get_instrument_values()[:, :10]) 
plt.axhline(me_srd.get_constant('theta'), color='r', 


ls='--', lw=2.0) 


plt.xticks(rotation=30); 
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Figure 18-3. Simulated paths from square-root diffusion simulation class (dashed line 
= long-term mean theta) 


Conclusion 


This chapter develops all the tools and classes needed for the simulation of the three 
stochastic processes of interest: geometric Brownian motion, jump diffusions, and 
square-root diffusions. The chapter presents a function to conveniently generate 
standard normally distributed random numbers. It then proceeds by introducing a 
generic model simulation class. Based on this foundation, the chapter introduces 
three specialized simulation classes and presents use cases for these classes. 


To simplify future imports one can again use a wrapper module, this one called 
dx_simulation.py: 


# 

# DX Package 

# 

# Simulation Functions & Classes 
# 

# dx_simulation. py 

# 

# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 

# 

import numpy as np 

import pandas as pd 
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from dx_frame import * 

from sn_random_numbers import sn_random_numbers 

from simulation_class import simulation_class 

from geometric_brownian_motion import geometric_brownian_motion 
from jump_diffusion import jump_diffusion 

from square_root_diffusion import square_root_diffusion 


As with the first wrapper module, dx_frame.py, the benefit is that a single import 
statement makes available all simulation components: 


from dx_simulation import * 


Since dx_simulation.py also imports everything from dx_frame.py, this single import 
in fact exposes all functionality developed so far. The same holds true for the 
enhanced __init__.py file in the dx folder: 


# 

# DX Package 

# packaging file 

# __init__.py 

# 

import numpy as np 
import pandas as pd 
import datetime as dt 


# frame 

from get_year_deltas import get_year_deltas 

from constant_short_rate import constant_short_rate 
from market_environment import market_environment 


# simulation 

from sn_random_numbers import sn_random_numbers 

from simulation_class import simulation_class 

from geometric_brownian_motion import geometric_brownian_motion 
from jump_diffusion import jump_diffusion 

from square_root_diffusion import square_root_diffusion 


Further Resources 


Useful references in book form for the topics covered in this chapter are: 


e Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. New 
York: Springer. 


e Hilpisch, Yves (2015): Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


Original papers cited in this chapter are: 


e Black, Fischer, and Myron Scholes (1973). “The Pricing of Options and Corpo- 
rate Liabilities.” Journal of Political Economy, Vol. 81, No. 3, pp. 638-659. 
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e Cox, John, Jonathan Ingersoll, and Stephen Ross (1985). “A Theory of the Term 
Structure of Interest Rates.” Econometrica, Vol. 53, No. 2, pp. 385-407. 

e Merton, Robert (1973). “Theory of Rational Option Pricing.” Bell Journal of Eco- 
nomics and Management Science, Vol. 4, pp. 141-183. 


e Merton, Robert (1976). “Option Pricing When the Underlying Stock Returns Are 
Discontinuous.” Journal of Financial Economics, Vol. 3, No. 3, pp. 125-144. 
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CHAPTER 19 
Derivatives Valuation 


Derivatives are a huge, complex issue. 


—Judd Gregg 


Options and derivatives valuation has long been the domain of the so-called rocket 
scientists on Wall Street—i.e., people with a PhD in physics or a similarly demanding 
discipline when it comes to the mathematics involved. However, the application of 
the models by the means of numerical methods like Monte Carlo simulation is gener- 
ally a little less involved than the theoretical models themselves. 


This is particularly true for the valuation of options and derivatives with European 
exercise—i.e., where exercise is only possible at a certain predetermined date. It is a 
bit less true for options and derivatives with American exercise, where exercise is 
allowed at any point over a prespecified period of time. This chapter introduces and 
uses the Least-Squares Monte Carlo (LSM) algorithm, which has become a bench- 
mark algorithm when it comes to American options valuation based on Monte Carlo 
simulation. 


The current chapter is similar in structure to Chapter 18 in that it first introduces a 
generic valuation class and then provides two specialized valuation classes, one for 
European exercise and another for American exercise. The generic valuation class 
contains methods to numerically estimate the most important Greeks of an option: 
the delta and the vega. Therefore, the valuation classes are important not only for val- 
uation purposes, but also for risk management purposes. 


The chapter is structured as follows: 


“Generic Valuation Class” on page 596 
This section introduces the generic valuation class from which the specific ones 
inherit. 
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“European Exercise” on page 600 
This section is about the valuation class for options and derivatives with Euro- 
pean exercise. 


“American Exercise” on page 607 
This section covers the valuation class for options and derivatives with American 
exercise. 


Generic Valuation Class 


As with the generic simulation class, one instantiates an object of the valuation class 
by providing only a few inputs (in this case, four): 


name 
A str object, as a name for the model simulation object 


underlying 
An instance of a simulation class representing the underlying 


mar_env 
An instance of the dx.market_environment class 


payoff_func 
A Python str object containing the payoff function for the option/derivative 


The generic class has three methods: 


update() 
Updates selected valuation parameters (attributes) 


delta() 
Calculates a numerical value for the delta of an option/derivative 


vega() 
Calculates the vega of an option/derivative 


Equipped with the background knowledge from the previous chapters about the DX 
package, the generic valuation class as presented here should be almost self- 
explanatory; where appropriate, inline comments are also provided. Again, the class 
is presented in its entirety first, then discussed in more detail: 


# 

# DX Package 

# 

# Valuation -- Base Class 
# 

# valuation_class.py 

# 
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# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 


# 


class valuation_class(object): 


''' Basic class for single-factor valuation. 


Attributes 


name: str 
name of the object 
underlying: instance of simulation class 
object modeling the single risk factor 
mar_env: instance of market_environment 
market environment data for valuation 
payoff_func: str 
derivatives payoff in Python syntax 
Example: 'np.maximum(maturity_value - 100, 0)' 
where maturity_value is the NumPy vector with 
respective values of the underlying 
Example: 'np.maximum(instrument_values - 100, 0)' 
where instrument_values is the NumPy matrix with 
values of the underlying over the whole time/path grid 


Methods 


update: 

updates selected valuation parameters 
delta: 

returns the delta of the derivative 
vega: 

returns the vega of the derivative 


FHL 


def _ init__(self, name, underlying, mar_env, payoff_func=''): 
self.name = name 
self.pricing_date = mar_env.pricing_date 


try: 

# strike is optional 

self.strike = mar_env.get_constant('strike') 
except: 


pass 

self.maturity = mar_env.get_constant('maturity') 

self.currency = mar_env.get_constant('currency') 

# simulation parameters and discount curve from simulation object 
self.frequency = underlying. frequency 

self.paths = underlying.paths 

self.discount_curve = underlying.discount_curve 

self .payoff_func = payoff_func 

self.underlying = underlying 

# provide pricing_date and maturity to underlying 


Generic Valuation Class 
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self .underlying.special_dates.extend([self.pricing_date, 
self .maturity]) 


def update(self, initial_value=None, volatility=None, 
strike=None, maturity=None): 
if initial_value is not None: 
self .underlying.update(initial_value=initial_value) 
if volatility is not None: 
self .underlying.update(volatility=volatility) 
if strike is not None: 
self.strike = strike 
if maturity is not None: 
self.maturity = maturity 
# add new maturity date if not in time_grid 
if maturity not in self.underlying.time_grid: 
self.underlying.special_dates.append(maturity) 
self .underlying.instrument_values = None 


def delta(self, interval=None, accuracy=4): 

if interval is None: 

interval = self.underlying.initial_value / 50. 
# forward-difference approximation 
# calculate left value for numerical delta 
value_left = self.present_value(fixed_seed=True) 
# numerical underlying value for right value 
initial_del = self.underlying.initial_value + interval 
self .underlying.update(initial_value=initial_del) 
# calculate right value for numerical delta 
value_right = self.present_value(fixed_seed=True) 
# reset the initial_value of the simulation object 
self .underlying.update(initial_value=initial_del - interval) 
delta = (value_right - value_left) / interval 
# correct for potential numerical errors 
if delta < -1.0: 

return -1.0 
elif delta > 1.0: 

return 1.0 
else: 

return round(delta, accuracy) 


def vega(self, interval=0.01, accuracy=4): 

if interval < self.underlying.volatility / 50.: 
interval = self.underlying.volatility / 50. 

# forward-difference approximation 

# calculate the left value for numerical vega 

value_left = self.present_value(fixed_seed=True) 

# numerical volatility value for right value 

vola_del = self.underlying.volatility + interval 

# update the simulation object 

self .underlying.update(volatility=vola_del) 

# calculate the right value for numerical vega 

value_right = self.present_value(fixed_seed=True) 
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# reset volatility value of simulation object 
self.underlying.update(volatility=vola_del - interval) 
vega = (value_right - value_left) / interval 

return round(vega, accuracy) 


One topic covered by the generic dx.valuation_class class is the estimation of 
Greeks. This is worth taking a closer look at. To this end, assume that a continuously 
differentiable function V (So, œ) is available that represents the present value of an 
option. The delta of the option is then defined as the first partial derivative with 


: : ƏV (- 
respect to the current value of the underlying So; i.e., A = G < ? 


Suppose now that from Monte Carlo valuation (see Chapter 12 and subsequent sec- 
tions in this chapter) there is a numerical Monte Carlo estimator V (S, o) available 
for the option value. A numerical approximation for the delta of the option is then 
given in Equation 19-1.’ This is what the delta() method of the generic valuation 
class implements. The method assumes the existence of a present_value() method 
that returns the Monte Carlo estimator given a certain set of parameter values. 


Equation 19-1. Numerical delta of an option 


V(S)+ AS, 0) - V (Sy oo) 
AS 


A= , AS >0 


Similarly, the vega of the instrument is defined as the first partial derivative of the 
av(:) 
005 


Again assuming the existence of a Monte Carlo estimator for the value of the option, 
Equation 19-2 provides a numerical approximation for the vega. This is what the 
vega() method of the dx. valuation_class class implements. 


present value with respect to the current (instantaneous) volatility o, i.e., V = 


Equation 19-2. Numerical vega of an option 


V (Sy 0 + Aa) - V (So 0) 
Ao i 


Ao > 0 


Note that the discussion of delta and vega is based only on the existence of either a 
differentiable function or a Monte Carlo estimator for the present value of an option. 
This is the very reason why one can define methods to numerically estimate these 
quantities without knowledge of the exact definition and numerical implementation 
of the Monte Carlo estimator. 


1 For details on how to estimate Greeks numerically by Monte Carlo simulation, refer to Chapter 7 of Glasser- 
man (2004). The code uses forward-difference schemes only since this leads to only one additional simulation 
and revaluation of the option. For example, a central-difference approximation would lead to two option 
revaluations and therefore a higher computational burden. 
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European Exercise 


The first case to which the generic valuation class is specialized is the case of Euro- 
pean exercise. To this end, consider the following simplified recipe to generate a 
Monte Carlo estimator for an option value: 


1. Simulate the relevant underlying risk factor S under the risk-neutral measure I 
times to come up with as many simulated values of the underlying at the matur- 
ity of the option T—i.e., S (i), i € {1, 2, .. I}. 

2. Calculate the payoff h; of the option at maturity for every simulated value of the 
underlying—i.e., h,(S,(i)), i € {1, 2, .. I}. 


3. Derive the Monte Carlo estimator for the option’s present value as 
= x 1 = 7 
V= T3 i1 hr(Sr(i)). 


The Valuation Class 


The following code shows the class implementing the present_value() method 
based on this recipe. In addition, it contains the method generate_payoff() to gen- 
erate the simulated paths and the payoff of the option given the simulated paths. 
This, of course, builds the very basis for the Monte Carlo estimator: 


# 

# DX Package 

# 

# Valuation -- European Exercise Class 
# 

# valuation_mcs_european.py 

# 


# Python for Finance, 2nd ed. 
# (c) Dr. Yves J. Hilpisch 

# 

import numpy as np 


from valuation_class import valuation_class 


class valuation_mcs_european(valuation_class): 
''' Class to value European options with arbitrary payoff 
by single-factor Monte Carlo simulation. 


Methods 


generate_payoff: 

returns payoffs given the paths and the payoff function 
present_value: 

returns present value (Monte Carlo estimator) 


rr 
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def generate_payoff(self, fixed_seed=False): 


fied 


Parameters 


fixed_seed: bool 
use same/fixed seed for valuation 


rr 


try: 
# strike is optional 
strike = self.strike 
except: 


pass 
paths = self.underlying.get_instrument_values(fixed_seed=fixed_seed) 
time_grid = self.underlying.time_grid 
try: 
time_index = np.where(time_grid == self.maturity)[0] 
time_index = int(time_index) 
except: 
print('Maturity date not in time grid of underlying.') 
maturity_value = paths[time_index] 
# average value over whole path 
mean_value = np.mean(paths[:time_index], axis=1) 
# maximum value over whole path 
max_value = np.amax(paths[:time_index], axis=1)[-1] 
# minimum value over whole path 
min_value = np.amin(paths[:time_index], axis=1)[-1] 
try: 
payoff = eval(self.payoff_func) 
return payoff 
except: 
print('Error evaluating payoff function.') 


def present_value(self, accuracy=6, fixed_seed=False, full=False): 


ER 


Parameters 


accuracy: int 
number of decimals in returned result 
fixed_seed: bool 
use same/fixed seed for valuation 
full: bool 
return also full 1d array of present values 
cash_flow = self.generate_payoff(fixed_seed=fixed_seed) 
discount_factor = self.discount_curve.get_discount_factors( 
(self.pricing_date, self.maturity))[0, 1] 
result = discount_factor * np.sum(cash_flow) / len(cash_flow) 
if full: 
return round(result, accuracy), discount_factor * cash_flow 
else: 
return round(result, accuracy) 
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The generate_payoff() method provides some special objects to be used for the def- 
inition of the payoff of the option: 


e strike is the strike of the option. 


e maturity_value represents the 1D ndarray object with the simulated values of 
the underlying at maturity of the option. 


e mean_value is the average of the underlying over a whole path from today until 
maturity. 


e max_value is the maximum value of the underlying over a whole path. 


e min_value gives the minimum value of the underlying over a whole path. 


The last three allow for the efficient handling of options with Asian (i.e., lookback or 
path-dependent) features. 


Flexible Payoffs 


The approach taken for the valuation of options and derivatives 
with European exercise is quite flexible in that arbitrary payoff 
functions can be defined. This allows, among other things, model- 
ing of derivatives with conditional exercise (e.g., options) as well as 
unconditional exercise (e.g., forwards). It also allows the inclusion 
of exotic payoff elements, such as lookback features. 


A Use Case 


The application of the valuation class dx. valuatiton_mcs_european is best illustrated 
by a specific use case. However, before a valuation class can be instantiated, an 
instance of a simulation object—i.e., an underlying for the option to be valued—is 
needed. From Chapter 18, the dx.geometric_brownian_motion class is used to 
model the underlying: 


In [64]: me_gbm = market_environment('me_gbm', dt.datetime(2020, 1, 1)) 


In [65]: me_gbm.add_constant('initial_value', 36.) 
me_gbm.add_constant('volatility', 0.2) 
me_gbm.add_constant('final_date', dt.datetime(2020, 12, 31)) 
me_gbm.add_constant('currency', 'EUR') 
me_gbm.add_constant('frequency', 'M') 
me_gbm.add_constant('paths', 10000) 


In [66]: csr = constant_short_rate('csr', 0.06) 
In [67]: me_gbm.add_curve('discount_curve', csr) 


In [68]: gbm = geometric_brownian_motion('gbm', me_gbm) 
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In addition to a simulation object, one needs to define a market environment for the 
option itself. It has to contain at least a maturity and a currency. Optionally, a value 
for the strike parameter can be included as well: 


In [69]: me_call = market_environment('me_call', me_gbm.pricing_date) 


In [70]: me_call.add_constant('strike', 40.) 
me_call.add_constant('maturity', dt.datetime(2020, 12, 31)) 
me_call.add_constant('currency', 'EUR') 
A central element, of course, is the payoff function, provided here as a str object 
containing Python code that the eval() function can evaluate. A European call 
option shall be modeled. Such an option has a payoff of hy = max(S; - K, 0), with Sy 
being the value of the underlying at maturity and K being the strike price of the 
option. In Python and NumPy—with vectorized storage of all simulated values—this 
takes on the following form: 


In [71]: payoff_func = 'np.maximum(maturity_value - strike, 0)' 


Having all the ingredients together, one can then instantiate an object from the 
dx.valuation_mcs_european class. With the valuation object available, all quantities 
of interest are only one method call away: 


In [72]: from valuation_mcs_european import valuation_mcs_european 


In [73]: eur_call = valuation_mcs_european('eur_call', underlying=gbm, 
mMar_env=me_call, payoff_func=payoff_func) 


In [74]: %time eur_call.present_value() (1) 
CPU times: user 14.8 ms, sys: 4.06 ms, total: 18.9 ms 
Wall time: 43.5 ms 

Out[74]: 2.146828 

In [75]: %time eur_call.delta() (2) 
CPU times: user 12.4 ms, sys: 2.68 ms, total: 15.1 ms 
Wall time: 40.1 ms 

Out[75]: 0.5155 

In [76]: %time eur_call.vega() © 
CPU times: user 21 ms, sys: 2.72 ms, total: 23.7 ms 
Wall time: 89.9 ms 


Out[76]: 14.301 
@ Estimates the present value of the European call option. 


© Estimates the delta of the option numerically; the delta is positive for calls. 
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© Estimates the vega of the option numerically; the vega is positive for both calls 
and puts. 


Once the valuation object is instantiated, a more comprehensive analysis of the 
present value and the Greeks is easily implemented. The following code calculates the 
present value, delta, and vega for initial values of the underlying ranging from 34 to 
46 EUR. The results are presented graphically in Figure 19-1: 


In [77]: %%time 

s_list = np.arange(34., 46.1, 2.) 

plist = []2dulvst = [J v Ust = E] 

for s in s_list: 
eur_call.update(initial_value=s) 
p_list.append(eur_call.present_vaLlue(fixed_seed=True)) 
d_list.append(eur_call.delta()) 
v_list.append(eur_call.vega()) 

CPU times: user 374 ms, sys: 8.82 ms, total: 383 ms 

Wall time: 609 ms 


In [78]: from plot_option_stats import plot_option_stats 


In [79]: plot_option_stats(s_list, p_list, d_list, v_list) 
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@ present value 


@ Delta 


@ Vega 
14 


12 


34 36 38 40 42 44 46 
initial value of underlying 


Figure 19-1. Present value, delta, and vega estimates for European call option 


The visualization makes use of the helper function plot_option_stats(): 


# 

# DX Package 

# 

# Valuation -- Plotting Options Statistics 
# 

# plot_option_stats.py 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

import matplotlib.pyplot as plt 


def plot_option_stats(s_list, p_list, d_list, v_list): 
''' Plots option prices, deltas, and vegas for a set of 
different initial values of the underlying. 


Parameters 


s_list: array or list 

set of initial values of the underlying 
p_list: array or list 

present values 
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d_list: array or list 

results for deltas 
v_list: array or list 

results for vegas 
plt.figure(figsize=(10, 7)) 
sub1 = plt.subplot(311) 
plt.plot(s_list, p_list, 'ro', label='present value') 
ple plot(s- list, p- tist, "b") 
plt.legend(loc=0) 
plt.setp(sub1.get_xticklabels(), visible=False) 
sub2 = plt.subplot(312) 
plt.plot(s_list, d_list, 'go', label='Delta') 
plt.plot(s lust, d_list, 'b') 
plt.legend(loc=0) 
plt.ylim(min(d_list) - 0.1, max(d_list) + 0.1) 
plt.setp(sub2.get_xticklabels(), visible=False) 
sub3 = plt.subplot(313) 
plt.plot(s_list, v_list, 'yo', label='Vega') 
ple plot(s_list, v- tist, "b") 
plt.xlabel('initial value of underlying') 
plt.legend(loc=0) 


This illustrates that working with the DX package—despite the fact that heavy numer- 
ics are involved—boils down to an approach that is comparable to having a closed- 
form option pricing formula available. However, this approach does not only apply to 
such simple or “plain vanilla” payoffs as the one considered so far. With exactly the 
same approach, one can handle more complex payoffs. 


To this end, consider the following payoff, a mixture of a regular and an Asian payoff. 
The handling and the analysis are the same and are mainly independent of the type of 
payoff defined. Figure 19-2 shows that delta becomes 1 when the initial value of the 
underlying reaches the strike price of 40 in this case. Every (marginal) increase of the 
initial value of the underlying leads to the same (marginal) increase in the option’s 
value from this particular point on: 


In [80]: payoff_func = 'np.maximum(0.33 * ' 
payoff_func += '(maturity_value + max_value) - 40, 0)' (1) 


In [81]: eur_as_call = valuation_mcs_european('eur_as_call', underlying=gbm, 
mar_env=me_call, payoff_func=payoff_func) 


In [82]: %%time 

s_list = np.arange(34., 46.1, 2.) 

p_list = []; d Tist = []: v úst = [] 

for s in s_list: 
eur_as_call.update(s) 
p_list.append(eur_as_call.present_value(fixed_seed=True)) 
d_list.append(eur_as_call.delta()) 
v_list.append(eur_as_call.vega()) 

CPU times: user 319 ms, sys: 14.2 ms, total: 333 ms 


606 | Chapter 19: Derivatives Valuation 


Wall time: 488 ms 
In [83]: plot_option_stats(s_list, p_list, d_list, v_list) 


@ Payoff dependent on both the simulated maturity value and the maximum value 
over the simulated path. 


10.0 @ present value 
7.5 
5.0 
2.5 


0.0 


1.0 @ Delta 
0.8 
0.6 
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initial value of underlying 


Figure 19-2. Present value, delta, and vega estimates for option with Asian feature 


American Exercise 


The valuation of options with American exercise or Bermudan exercise is much more 
involved than with European exercise.” Therefore, a bit more valuation theory is 
needed before proceeding to the valuation class. 


2 American exercise refers to a situation where exercise is possible at every instant of time over a fixed time 
interval (at least during trading hours). Bermudan exercise generally refers to a situation where there are mul- 
tiple discrete exercise dates. In numerical applications, American exercise is approximated by Bermudan 
exercise, and maybe letting the number of exercise dates go to infinity in the limit. 
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Least-Squares Monte Carlo 


Although Cox, Ross, and Rubinstein (1979) presented with their binomial model a 
simple numerical method to value European and American options in the same 
framework, only with the Longstaff-Schwartz (2001) approach was the valuation of 
American options by Monte Carlo simulation (MCS) satisfactorily solved. The major 
problem is that MCS per se is a forward-moving algorithm, while the valuation of 
American options is generally accomplished by backward induction, estimating the 
continuation value of the American option starting at maturity and working back to 
the present. 


The major insight of the Longstaff-Schwartz (2001) model is to use an ordinary least- 
squares regression to estimate the continuation value based on the cross section of all 
available simulated values.’ The algorithm takes into account, per path: 


e The simulated value of the underlying(s) 
e The inner value of the option 


e The actual continuation value given the specific path 


In discrete time, the value of a Bermudan option (and in the limit of an American 
option) is given by the optimal stopping problem, as presented in Equation 19-3 for a 
finite set of points in time 0 < t, < t, < ... < Ti 


Equation 19-3. Optimal stopping problem in discrete time for Bermudan option 
Vo= supe “E9(h,(S,)) 


TE{O,b styp-.0T } 


Equation 19-4 presents the continuation value of the American option at date 0 < t,, 
< T. It is the risk-neutral expectation at date t„ under the martingale measure of the 
value of the American option V, at the subsequent date. 


Equation 19-4. Continuation value for the American option 


C, (8) = e AER (V (Sea) |S = $) 


m 


3 That is why their algorithm is generally abbreviated as LSM, for Least-Squares Monte Carlo. 


4 Kohler (2010) provides a concise overview of the theory of American option valuation in general and the use 
of regression-based methods in particular. 
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The value of the American option V, at date ¢,, can be shown to equal the formula in 


Equation 19-5—i.e., the maximum of the payoff of immediate exercise (inner value) 
and the expected payoff of not exercising (continuation value). 


Equation 19-5. Value of American option at any given date 


V, = max (h, (s), C, (s)) 


n 


In Equation 19-5, the inner value is of course easily calculated. The continuation 
value is what makes it a bit trickier. The Longstaff-Schwartz (2001) algorithm 
approximates this value by a regression, as presented in Equation 19-6. There, i 
stands for the current simulated path, D is the number of basis functions for the 
regression used, a’ are the optimal regression parameters, and b; is the regression 
function with number d. 


Equation 19-6. Regression-based approximation of continuation value 


The optimal regression parameters are the result of the solution of the least-squares 
lina ndy , is the 


m+! 


actual continuation value at date t,, for path i (and not a regressed/estimated one). 


regression problem presented in Equation 19-7. Here, Y, ; =e 


Equation 19-7. Ordinary least-squares regression 


This completes the basic (mathematical) tool set to value an American option by 
MCS. 


The Valuation Class 


The code that follows represents the class for the valuation of options and derivatives 
with American exercise. There is one noteworthy step in the implementation of the 
LSM algorithm in the present_value() method (which is also commented on 
inline): the optimal decision step. Here, it is important that, based on the decision that 
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is made, the LSM algorithm takes either the inner value or the actual continuation 
value, and not the estimated continuation value:> 


# 


# 
# 
# 
# 
# 
# 
# 
# 


DX Package 
Valuation -- American Exercise Class 
valuation_mcs_american. py 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


import numpy as np 


from valuation_class import valuation_class 


class valuation_mcs_american(valuation_class): 


''' Class to value American options with arbitrary payoff 
by single-factor Monte Carlo simulation. 


Methods 


generate_payoff: 
returns payoffs given the paths and the payoff function 
present_value: 
returns present value (LSM Monte Carlo estimator) 
according to Longstaff-Schwartz (2001) 


rri 


def generate_payoff (self, fixed_seed=False): 


tft 


Parameters 


fixed_seed: 
use same/fixed seed for valuation 


rir 


try: 
# strike is optional 
strike = self.strike 
except: 


pass 
paths = self.underlying.get_instrument_values(fixed_seed=fixed_seed) 
time_grid = self.underlying.time_grid 
time_index_start = int(np.where(time_grid == self.pricing_date)[0]) 
time_index_end = int(np.where(time_grid == self.maturity)[0]) 
instrument_values = paths[time_index_start:time_index_end + 1] 
payoff = eval(self.payoff_func) 


5 See also Chapter 6 of Hilpisch (2015). 
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return instrument_values, payoff, time_index_start, time_index_end 


def present_value(self, accuracy=6, fixed_seed=False, bf=5, full=False): 


Lae 


Parameters 


accuracy: int 

number of decimals in returned result 
fixed_seed: bool 

use same/fixed seed for valuation 


bf: int 
number of basis functions for regression 
full: bool 


return also full 1d array of present values 
instrument_values, inner_values, time_index_start, time_index_end = \ 
self .generate_payoff(fixed_seed=fixed_seed) 
time_list = self.underlying.time_grid[ 
time_index_start:time_index_end + 1] 
discount_factors = self.discount_curve.get_discount_factors( 
time_list, dtobjects=True) 
V = inner_values[-1] 
for t in range(len(time_list) - 2, 0, -1): 
# derive relevant discount factor for given time interval 
df = discount_factors[t, 1] / discount_factors[t + 1, 1] 
# regression step 
rg = np.polyfit(instrument_values[t], V * df, bf) 
# calculation of continuation values per path 
C = np.polyval(rg, instrument_values[t]) 
# optimal decision step: 
# if condition is satisfied (inner value > regressed cont. value) 
# then take inner value; take actual cont. value otherwise 
V = np.where(inner_values[t] > C, inner_values[t], V * df) 
df = discount_factors[0, 1] / discount_factors[1, 1] 
result = df * np.sum(V) / len(V) 
if full: 
return round(result, accuracy), df * V 
else: 
return round(result, accuracy) 


A Use Case 


As has become by now the means of choice, a use case shall illustrate how to work 
with the dx.valuation_mcs_american class. The use case replicates all American 
option values as presented in Table 1 of the seminal paper by Longstaff and Schwartz 
(2001). The underlying is the same as before, a dx.geometric_brownian_motion 
object. The initial parameterization is as follows: 


In [84]: me_gbm = market_environment('me_gbm', dt.datetime(2020, 1, 1)) 


In [85]: me_gbm.add_constant('initial_value', 36.) 
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me_gbm.add_constant('volatility', 0.2) 
me_gbm.add_constant('final_date', dt.datetime(2021, 12, 31)) 
me_gbm.add_constant('currency', 'EUR') 
me_gbm.add_constant('frequency', 'W') 
me_gbm.add_constant('paths', 50000) 


In [86]: csr = constant_short_rate('csr', 0.06) 

In [87]: me_gbm.add_curve('discount_curve', csr) 

In [88]: gbm = geometric_brownian_motion('gbm', me_gbm) 

In [89]: payoff_func = 'np.maximum(strike - instrument_values, 0)' 

In [90]: me_am_put = market_environment('me_am_put', dt.datetime(2020, 1, 1)) 


In [91]: me_am_put.add_constant('maturity', dt.datetime(2020, 12, 31)) 
me_am_put.add_constant('strike', 40.) 
me_am_put.add_constant('currency', 'EUR') 


The next step is to instantiate the valuation object based on the numerical assump- 
tions and to initiate the valuations. The valuation of the American put option can 
take quite a bit longer than the same task for the European options. Not only is the 
number of paths and time intervals increased, but the algorithm is also more compu- 
tationally demanding due to the backward induction and the regression per induc- 
tion step. The numerical estimate obtained for the first option considered is close to 
the correct one reported in the original paper of 4.478: 


In [92]: from valuation_mcs_american import valuation_mcs_american 


In [93]: am_put = valuation_mcs_american('am_put', underlying=gbm, 
mar_env=me_am_put, payoff_func=payoff_func) 


In [94]: %time am_put.present_value(fixed_seed=True, bf=5) 
CPU times: user 1.57 s, sys: 219 ms, total: 1.79 s 
Wall time: 2.01 s 


Out[94]: 4.472834 


Due to the very construction of the LSM Monte Carlo estimator, it represents a lower 
bound of the mathematically correct American option value.° Therefore, one expects 
the numerical estimate to lie under the true value in any numerically realistic case. 
Alternative dual estimators can provide upper bounds as well.’ Taken together, two 
such different estimators then define an interval for the true American option value. 


6 The main reason is that the “optimal” exercise policy based on the regression estimates for the continuation 
values is in fact “suboptimal.” 


7 See Chapter 6 in Hilpisch (2015) for a dual algorithm leading to an upper bound and a Python implementa- 
tion thereof. 
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The main stated goal of this use case is to replicate all American option values of 
Table 1 in the original paper. To this end, one only needs to combine the valuation 
object with a nested loop. During the innermost loop, the valuation object has to be 
updated according to the then-current parameterization: 


In [95]: %%time 
ls_table = [] 
for initial_value in (36., 38., 40., 42., 44.): 
for volatility in (0.2, 0.4): 
for maturity in (dt.datetime(2020, 12, 31), 
dt.datetime(2021, 12, 31)): 
am_put.update(initial_value=initial_value, 
volatility=volatility, 
maturity=maturity) 
ls_table.append([initial_value, 
volatility, 
maturity, 
am_put.present_value(bf=5)]) 
CPU times: user 41.1 s, sys: 2.46 s, total: 43.5 s 
Wall time: imin 30s 


In [96]: print('SO | Vola | T | Value') 
print(22 * '-') 
for r in ls_table: 
print('%d | %3.1f | %d | %5.3f' % 
(r[O], r[1], r[2].year - 2019, r[3])) 


SO Vola | T | Value 
36 | 0.2 | 1 | 4.447 
36 | 0.2 | 2 | 4.773 
36 0.4 | 1 | 7.006 
36 0.4 | 2 | 8.377 
38 On 32 323 
38 | 0.2 | 2 | 3.645 
38 0.4 | 1 | 6.069 
38 0.4 | 2 | 7.539 
40 | 0.2 | 1 | 2.269 
40 | 0.2 | 2 | 2.781 
40 | 0.4 | 1 | 5.211 
40 0.4 | 2 | 6.756 
42 O-2 | 2 | a9556 
42 | 0.2 | 2 | 2.102 
42 0.4 | 1 | 4.466 
42 | 0.4 | 2 | 6.049 
44 | 0.2 | 1 | 1.059 
44 | 0.2 | 2 | 1.617 
44 | 0.4 | 1 | 3.852 
44 0.4 | 2 | 5.490 


These results are a simplified version of Table 1 in the paper by Longstaff and 
Schwartz (2001). Overall, the numerical values come close to those reported in the 


American Exercise | 613 


paper, where some different parameters have been used (they use, for example, dou- 
ble the number of paths). 


To conclude the use case, note that the estimation of Greeks for American options is 
formally the same as for European options—a major advantage of the implemented 
approach over alternative numerical methods (like the binomial model): 

In [97]: am_put.update(initial_value=36. ) 


am_put.delta() 
Out[97]: -0.4631 


In [98]: am_put.vega() 
Out[98]: 18.0961 


Least-Squares Monte Carlo 


The LSM valuation algorithm of Longstaff and Schwartz (2001) is a 
numerically efficient algorithm to value options and even complex 
derivatives with American or Bermudan exercise features. The OLS 
regression step allows the approximation of the optimal exercise 
strategy based on an efficient numerical method. Since OLS regres- 
sion can easily handle high-dimensional data, it makes it a flexible 
method in derivatives pricing. 


Conclusion 


This chapter is about the numerical valuation of European and American options 
based on Monte Carlo simulation. The chapter introduces a generic valuation class, 
called dx.valuation_class. This class provides methods, for example, to estimate 
the most important Greeks (delta, vega) for both types of options, independent of the 
simulation object (i.e., the risk factor or stochastic process) used for the valuation. 


Based on the generic valuation class, the chapter presents two specialized classes, 
dx.valuation_mcs_european and dx.valuation_mcs_american. The class for the 
valuation of European options is mainly a straightforward implementation of the 
risk-neutral valuation approach presented in Chapter 17 in combination with the 
numerical estimation of an expectation term (i.e., an integral by Monte Carlo simula- 
tion, as discussed in Chapter 11). 


The class for the valuation of American options needs a certain kind of regression- 
based valuation algorithm, called Least-Squares Monte Carlo (LSM). This is due to 
the fact that for American options an optimal exercise policy has to be derived for a 
valuation. This is theoretically and numerically a bit more involved. However, the 
respective present_value() method of the class is still concise. 
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The approach taken with the DX derivatives analytics package proves to be beneficial. 
Without too much effort one is able to value a relatively large class of options with 
the following features: 


e Single risk factor 
e European or American exercise 
e Arbitrary payoff 


In addition, one can estimate the most important Greeks for this class of options. To 
simplify future imports, again a wrapper module is used, this time called dx_valua- 


tion.py: 


+ 


DX Package 
Valuation Classes 
dx_valuation.py 


Python for Finance, 2nd ed. 
(c) Dr. Yves J. Hilpisch 


bah h e a RHR RR 


import numpy as np 
import pandas as pd 


from dx_simulation import * 

from valuation_class import valuation_class 

from valuation_mcs_european import valuation_mcs_european 
from valuation_mcs_american import valuation_mcs_american 


The __init__.py file in the dx folder is updated accordingly: 


# 

# DX Package 

# packaging file 

# __init__.py 

# 

import numpy as np 
import pandas as pd 
import datetime as dt 


# frame 

from get_year_deltas import get_year_deltas 

from constant_short_rate import constant_short_rate 
from market_environment import market_environment 
from plot_option_stats import plot_option_stats 


# simulation 
from sn_random_numbers import sn_random_numbers 
from simulation_class import simulation_class 
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from geometric_brownian_motion import geometric_brownian_motion 
from jump_diffusion import jump_diffusion 
from square_root_diffusion import square_root_diffusion 


# valuation 

from valuation_class import valuation_class 

from valuation_mcs_european import valuation_mcs_european 
from valuation_mcs_american import valuation_mcs_american 


Further Resources 


References for the topics of this chapter in book form are: 


e Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. New 
York: Springer. 

e Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


Original papers cited in this chapter are: 


e Cox, John, Stephen Ross, and Mark Rubinstein (1979). “Option Pricing: A Sim- 
plified Approach.” Journal of Financial Economics, Vol. 7, No. 3, pp. 229-263. 


e Kohler, Michael (2010). “A Review on Regression-Based Monte Carlo Methods 
for Pricing American Options.” In Luc Devroye et al. (eds.): Recent Developments 
in Applied Probability and Statistics (pp. 37-58). Heidelberg: Physica-Verlag. 

e Longstaff, Francis, and Eduardo Schwartz (2001). “Valuing American Options by 


Simulation: A Simple Least Squares Approach.” Review of Financial Studies, Vol. 
14, No. 1, pp. 113-147. 
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CHAPTER 20 
Portfolio Valuation 


Price is what you pay. Value is what you get. 


—Warren Buffet 


By now, the whole approach for building the DX derivatives analytics package—and 
its associated benefits—should be clear. By strictly relying on Monte Carlo simulation 
as the only numerical method, the approach accomplishes an almost complete modu- 
larization of the analytics package: 


Discounting 
The relevant risk-neutral discounting is taken care of by an instance of the 
dx.constant_short_rate class. 


Relevant data 
Relevant data, parameters, and other input are stored in (several) instances of the 
dx.market_environment class. 


Simulation objects 
Relevant risk factors (underlyings) are modeled as instances of one of three sim- 
ulation classes: 


e dx.geometric_brownian_motion 
e dx. jump_diffusion 


e dx.square_root_diffusion 
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Valuation objects 
Options and derivatives to be valued are modeled as instances of one of two valu- 
ation classes: 


e dx.valuation_mcs_european 


e dx.valuation_mcs_american 


One last step is missing: the valuation of possibly complex portfolios of options and 
derivatives. To this end, the following requirements shall be satisfied: 


Nonredundancy 
Every risk factor (underlying) is modeled only once and potentially used by mul- 
tiple valuation objects. 


Correlations 
Correlations between risk factors have to be accounted for. 


Positions 
An option position, for example, consists of a certain number of option 
contracts. 


However, although it is in principle allowed (it is in fact even required) to provide a 
currency for both simulation and valuation objects, the following code assumes that 
portfolios are denominated in a single currency only. This simplifies the aggregation 
of values within a portfolio significantly, because one can abstract from exchange 
rates and currency risks. 


The chapter presents two new classes: a simple one to model a derivatives position, 
and a more complex one to model and value a derivatives portfolio. It is structured as 
follows: 


“Derivatives Positions” on page 618 
This section introduces the class to model a single derivatives position. 


“Derivatives Portfolios” on page 622 
This section introduces the core class to value a portfolio of potentially many 
derivatives positions. 


Derivatives Positions 


In principle, a derivatives position is nothing more than a combination of a valuation 
object and a quantity for the instrument modeled. 
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The Class 


The code that follows presents the class to model a derivatives position. It is mainly a 
container for data and objects. In addition, it provides a get_info() method, print- 


ing the data and object information stored in an instance of the class: 


# 

# DX Package 

# 

# Portfolio -- Derivatives Position Class 
# 

# derivatives_position. py 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 


class derivatives_position(object): 
''' Class to model a derivatives position. 


Attributes 


name: str 

name of the object 
quantity: float 

number of assets/derivatives making up the position 
underlying: str 

name of asset/risk factor for the derivative 
mar_env: instance of market_environment 

constants, lists, and curves relevant for valuation_class 
otype: str 

valuation class to use 
payoff_func: str 

payoff string for the derivative 


Methods 


get_info: 
prints information about the derivatives position 


rri 


def __ init__(self, name, quantity, underlying, mar_env, 

otype, payoff_func): 

self.name = name 

self.quantity = quantity 

self.underlying = underlying 

self.mar_env = mar_env 

self.otype = otype 

self.payoff_func = payoff_func 
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def get_info(self): 

print('NAME') 

print(self.name, '\n') 

print('QUANTITY') 

print(self.quantity, '\n') 

print( 'UNDERLYING' ) 

print(self.underlying, '\n') 

print('MARKET ENVIRONMENT' ) 

print('\n**Constants**') 

for key, value in self.mar_env.constants.items(): 
print(key, value) 

print('\n**Lists**') 

for key, value in self.mar_env.lists.items(): 
print(key, value) 

print('\n**Curves**') 

for key in self.mar_env.curves.items(): 
print(key, value) 

print('\nOPTION TYPE') 

print(self.otype, '\n') 

print('PAYOFF FUNCTION') 

print(self.payoff_func) 


To define a derivatives position the following information is required, which is 
almost the same as for the instantiation of a valuation class: 


name 
Name of the position as a str object 


quantity 
Quantity of options/derivatives 


underlying 
Instance of simulation object as a risk factor 


mar_env 
Instance of dx.market_environment 


otype 
str, either "European" or "American" 


payoff_func 
Payoff as a Python str object 


A Use Case 


The following interactive session illustrates the use of the class. However, first a defi- 
nition of a simulation object is needed (but not in full; only the most important, 
object-specific information is required): 
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In [99]: from dx_valuation import * 
In [100]: me_gbm = market_environment('me_gbm', dt.datetime(2020, 1, 1)) (1) 


In [101]: me_gbm.add_constant('initial_value', 36.) 1) 
me_gbm.add_constant('volatility', 0.2) (1) 
me_gbm.add_constant('currency', 'EUR') (1) 


In [102]: me_gbm.add_constant('model', 'gbm') (2) 
@ Thedx.market_environment object for the underlying. 


@ The model type needs to be specified here. 


Similarly, for the definition of the derivatives position, one does not need a “com- 
plete” dx.market_environment object. Missing information is provided later (during 
the portfolio valuation), when the simulation object is instantiated: 


In [103]: from derivatives_position import derivatives_position 
In [104]: me_am_put = market_environment('me_am_put', dt.datetime(2020, 1, 1)) (1) 


In [105]: me_am_put.add_constant('maturity', dt.datetime(2020, 12, 31)) (1) 
me_am_put.add_constant('strike', 40.) 
me_am_put.add_constant('currency', 'EUR') (1) 


In [106]: payoff_func = 'np.maximum(strike - instrument_values, 0)' @ 


In [107]: am_put_pos = derivatives_position( 
Name='am_put_pos', 
quantity=3, 
underlying='gbm', 
mar_env=me_am_put, 
otype='American', 
payoff_func=payoff_func) © 


In [108]: am_put_pos.get_info() 
NAME 
am_put_pos 


QUANTITY 
3 


UNDERLYING 
gbm 


MARKET ENVIRONMENT 
**Constants** 


maturity 2020-12-31 00:00:00 
strike 40.0 
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currency EUR 
**_Lists** 
**Curves** 


OPTION TYPE 
American 


PAYOFF FUNCTION 
np.maximum(strike - instrument_vaLlues, 0) 


@ The dx.market_environment object for the derivative. 
© The payoff function of the derivative. 


© The instantiation of the derivatives_position object. 


Derivatives Portfolios 


From a portfolio perspective, a relevant market is mainly composed of the relevant 
risk factors (underlyings) and their correlations, as well as the derivatives and deriva- 
tives positions, respectively, to be valued. Theoretically, the analysis to follow now 
deals with a general market model Ml as defined in Chapter 17, and applies the Fun- 
damental Theorem of Asset Pricing (with its corollaries) to it.' 


The Class 


A somewhat complex Python class implementing a portfolio valuation based on the 
Fundamental Theorem of Asset Pricing—taking into account multiple relevant risk 
factors and multiple derivatives positions—is presented next. The class is docu- 
mented inline, especially during passages that implement functionality specific to the 
purpose at hand: 


# 

# DX Package 

# 

# Portfolio -- Derivatives Portfolio Class 
# 

# derivatives_portfolio.py 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 


1 In practice, the approach chosen here is sometimes called global valuation instead of instrument-specific valu- 
ation. See Albanese, Gimonet, and White (2010a). 
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import numpy as np 
import pandas as pd 


from dx_valuation import * 


# models available for risk factor modeling 
models = {'gbm': geometric_brownian_motion, 
'jd': jump_diffusion, 
'srd': square_root_diffusion} 


# allowed exercise types 


otypes = {'European': valuation_mcs_european, 
"American': valuation_mcs_american} 


class derivatives_portfolio(object): 


''' Class for modeling and valuing portfolios of derivatives positions. 


Attributes 


name: str 
name of the object 
positions: dict 


dictionary of positions (instances of derivatives_position class) 


val_env: market_environment 
market environment for the valuation 
assets: dict 
dictionary of market environments for the assets 
correlations: list 
correlations between assets 
fixed_seed: bool 
flag for fixed random number generator seed 


Methods 


get_positions: 

prints information about the single portfolio positions 
get_statistics: 

returns a pandas DataFrame object with portfolio statistics 


tet 


def __init__(self, name, positions, val_env, assets, 
correlations=None, fixed_seed=False): 

self.name = name 
self.positions = positions 
self.val_env = val_env 
self.assets = assets 
self.underlyings = set() 
self.correlations = correlations 
self.time_grid = None 
self .underlying_objects = {} 
self.valuation_objects = {} 
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self.fixed_seed = fixed_seed 
self.special_dates = [] 
for pos in self.positions: 
# determine earliest starting_date 
self.val_env.constants['starting_date'] = \ 
min(self.val_env.constants['starting_date'], 
positions[pos].mar_env.pricing_date) 
# determine latest date of relevance 
self.val_env.constants['final_date'] = \ 
max(self.val_env.constants['final_date'], 
positions[pos].mar_env.constants['maturity']) 
# collect all underlyings and 
# add to set (avoids redundancy) 
self.underlyings.add(positions[pos].underlying) 


# generate general time grid 
start = self.val_env.constants['starting_date'] 
end = self.val_env.constants['final_date'] 
time_grid = pd.date_range(start=start, end=end, 
freq=self.val_env.constants['frequency' ] 
).to_pydatetime() 
time_grid = list(time_grid) 
for pos in self.positions: 
maturity_date = positions[pos].mar_env.constants[ 'maturity'] 
if maturity_date not in time_grid: 
time_grid.insert(0, maturity_date) 
self.special_dates.append(maturity_date) 
if start not in time_grid: 
time_grid.insert(0, start) 
if end not in time_grid: 
time_grid.append(end) 
# delete duplicate entries 
time_grid = list(set(time_grid)) 
# sort dates in time_grid 
time_grid.sort() 
self.time_grid = np.array(time_grid) 
self.val_env.add_list('time_grid', self.time_grid) 


if correlations is not None: 
# take care of correlations 
ul_list = sorted(self.underlyings) 
correlation_matrix = np.zeros((len(ul_list), len(ul_list))) 
np.fill_diagonal(correlation_matrix, 1.0) 
correlation_matrix = pd.DataFrame(correlation_matrix, 
index=ul_list, columns=ul_list) 
for i, j, corr in correlations: 
corr = min(corr, 0.999999999999) 
# fill correlation matrix 
correlation_matrix.loc[i, j] = corr 
correlation_matrix.loc[j, i] = corr 
# determine Cholesky matrix 
cholesky_matrix = np.linalg.cholesky(np.array(correlation_matrix) ) 
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# dictionary with index positions for the 
# slice of the random number array to be used by 
# respective underlying 
rn_set = {asset: ul_list.index(asset) 
for asset in self.underlyings} 


# random numbers array, to be used by 

# all underlyings (if correlations exist) 

random_numbers = sn_random_numbers((Llen(rn_set), 
len(self.time_grid), 
self.val_env.constants['paths']), 
fixed_seed=self.fixed_seed) 


# add all to valuation environment that is 

# to be shared with every underlying 
self.val_env.add_list('cholesky_matrix', cholesky_matrix) 
self.val_env.add_list('random_numbers', random_numbers) 
self.val_env.add_list('rn_set', rn_set) 


for asset in self.underlyings: 

# select market environment of asset 

mar_env = self.assets[asset] 

# add valuation environment to market environment 

mar_env.add_environment(val_env) 

# select right simulation class 

model = models[mar_env.constants['model']] 

# instantiate simulation object 

if correlations is not None: 
self.underlying_objects[asset] = model(asset, mar_env, 

corr=True) 

else: 

self.underlying_objects[asset] 


model(asset, mar_env, 
corr=False) 


for pos in positions: 
# select right valuation class (European, American) 
val_class = otypes[positions[pos].otype] 
# pick market environment and add valuation environment 
mar_env = positions[pos].mar_env 
mar_env.add_environment(self.val_env) 
# instantiate valuation class 
self.valuation_objects[pos] = \ 
val_class(name=positions[pos].name, 
mar_env=mar_env, 
underlying=self.underlying_objects[ 
positions[pos].underlying], 
payoff_func=positions[pos].payoff_func) 


def get_positions(self): 
''' Convenience method to get information about 
all derivatives positions ina portfolio. ''' 
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for pos in self.positions: 
bar = '\n' + 50 * '-' 
print(bar) 
self.positions[pos].get_info() 
print(bar) 


def get_statistics(self, fixed_seed=False): 
''' Provides portfolio statistics. ''' 
res_list = [] 
# iterate over all positions in portfolio 
for pos, value in self.valuation_objects.items(): 
p = self.positions[pos] 
pv = value.present_value(fixed_seed=fixed_seed) 
res_list.append([ 
p.name, 
p.quantity, 
# calculate all present values for the single instruments 
pv, 
value.currency, 
# single instrument value times quantity 
pv * p.quantity, 
# calculate delta of position 
value.delta() * p.quantity, 
# calculate vega of position 
value.vega() * p.quantity, 
]) 
# generate a pandas DataFrame object with all results 
res_df = pd.DataFrame(res_list, 
columns=['name', 'quant.', 'value', 'curr.', 
'pos_value', 'pos_delta', 'pos_vega']) 
return res_df 


Object Orientation 


The class dx.derivatives_portfolio illustrates a number of ben- 
efits of object orientation as mentioned in Chapter 6. At first 
inspection, it might look like a complex piece of Python code. 
However, the financial problem that it solves is a pretty complex 
one and it provides the flexibility to address a large number of dif- 
ferent use cases. It is hard to imagine how all this could be achieved 
without the use of object-oriented programming and Python 
classes. 


A Use Case 


In terms of the DX analytics package, the modeling capabilities are, on a high level, 
restricted to a combination of a simulation and a valuation class. There are a total of 
six possible combinations: 
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models = {'gbm' : geometric_brownian_motion, 


otypes = 


'jd' : jump_diffusion 
'srd': square_root_diffusion} 


{'European' : valuation_mcs_european, 


"American' : valuation_mcs_american} 


The interactive use case that follows combines selected elements to define two differ- 
ent derivatives positions that are then combined into a portfolio. 


Recall the derivatives_position class with the gbm and am_put_pos objects from 
the previous section. To illustrate the use of the derivatives_portfolio class, we'll 
define both an additional underlying and an additional options position. First, a 
dx. jump_dif fusion object: 


In [109]: 


In [110]: 


In [111]: 


me_jd = market_environment('me_jd', me_gbm.pricing_date) 


me_jd.add_constant('lambda', 0.3) (1) 
me_jd.add_constant('mu', -0.75) 
me_jd.add_constant('delta', 0.1) 
me_jd.add_environment(me_gbm) (2) 


me_jd.add_constant('model', 'jd') © 


@ Adds jump diffusion-specific parameters. 


© Adds other parameters from gbm. 


© Needed for portfolio valuation. 


Second, a European call option based on this new simulation object: 


In [112]: 


In [113]: 


In [114]: 


In [115]: 


me_eur_call = market_environment('me_eur_call', me_jd.pricing_date) 


me_eur_call.add_constant('maturity', dt.datetime(2020, 6, 30)) 
me_eur_call.add_constant('strike', 38.) 
me_eur_call.add_constant('currency', 'EUR') 


payoff_func = 'np.maximum(maturity_value - strike, 0)' 


eur_call_pos = derivatives_position( 
Name='eur_call_pos', 
quantity=5, 
underlying='jd', 
Mar_env=me_eur_call, 
otype='European', 
payoff_func=payoff_func) 


From a portfolio perspective, the relevant market now is as shown in the following in 
underlyings and positions. For the moment, the definitions do not include correla- 
tions between the underlyings. Compiling a dx.market_environment for the portfo- 
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lio valuation is the last step before the instantiation of a derivatives_portfolio 
object: 


In [116]: underlyings = {'gbm': me_gbm, 'jd' : me_jd} (1) 
positions = {'am_put_pos' : am_put_pos, 
'eur_call_pos' : eur_call_pos} (2) 


In [117]: csr = constant_short_rate('csr', 0.06) © 


In [118]: val_env = market_environment('general', me_gbm.pricing_date) 
val_env.add_constant('frequency', 'W') 
val_env.add_constant('paths', 25000) 
val_env.add_constant('starting_date', val_env.pricing_date) 
val_env.add_constant('final_date', val_env.pricing_date) (4) 
val_env.add_curve('discount_curve', csr) © 


In [119]: from derivatives_portfolio import derivatives_portfolio 


In [120]: portfolio = derivatives_portfolio( 
name='portfolio', 
positions=positions, 
val_env=val_env, 
assets=underlyings, 
fixed_seed=False) (5) 


Relevant risk factors. 


Relevant portfolio postions. 


(1) 

(2) 

© Unique discounting object for the portfolio valuation. 

© final_date is not yet known; therefore, set pricing_date as preliminary value. 
(53 


Instantiation of the derivatives_portfolio object. 


Now one can harness the power of the valuation class and easily get important statis- 
tics for the derivatives_portfolio object just defined. The sum of the position val- 
ues, deltas, and vegas is also easily calculated. This portfolio is slightly long delta 
(almost neutral) and long vega: 


In [121]: “time portfolio.get_statistics(fixed_seed=False) 
CPU times: user 4.68 s, sys: 409 ms, total: 5.09 s 
Wall time: 14.5 s 


Out[121]: 

Name quant. value curr. pos_value pos_delta pos_vega 
0 am_put_pos 3 4.458891 EUR 13.376673 -2.0430 31.7850 
1 eur_call_pos 5 2.828634 EUR 14.143170 3.2525 42.2655 


In [122]: portfolio.get_statistics(fixed_seed=False)[ 
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['pos_value', 'pos_delta', 'pos_vega']].sum() (13 
Out[122]: pos_value 27902731 
pos_delta 1.233500 
pos_vega 74.050500 
dtype: float64 


In [123]: portfolio.get_positions() (2) 


In [124]: portfolio. valuation_objects[ 'am_put_pos'].present_value() © 
Out[124]: 4.453187 


In [125]: portfolio.valuation_objects[ 'eur_call_pos'].delta() (4) 
Out[125]: 0.6514 


Aggregation of single position values. 


© 


This method call would create a rather lengthy output about all positions. 


© 


The present value estimate for a single position. 


The delta estimate for a single position. 


The derivatives portfolio valuation is conducted based on the assumption that the 
risk factors are not correlated. This is easily verified by inspecting two simulated 
paths (see Figure 20-1), one for each simulation object: 


In [126]: path_no = 888 
path_gbm = portfolio.underlying_objects[ 
'gbm'].get_instrument_values()[:, path_no] 
path_jd = portfolio.underlying_objects[ 
'jd'].get_instrument_values()[:, path_no] 


In [127]: plt.figure(figsize=(10,6)) 
plt.plot(portfolio.time_grid, path_gbm, 'r', label='gbm') 
plt.plot(portfolio.time_grid, path_jd, 'b', label='jd') 
plt.xticks(rotation=30) 
plt.legend(loc=0) 
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Figure 20-1. Noncorrelated risk factors (two sample paths) 


Now consider the case where the two risk factors are highly positively correlated. In 
this case, there is no direct influence on the values of the single positions in the 
portfolio: 


In [128]: correlations = [['gbm', 'jd', 0.9]] 


In [129]: port_corr = derivatives_portfolio( 
name='portfolio', 
positions=positions, 
val_env=val_env, 
assets=underlyings, 
correlations=correLations, 
fixed_seed=True) 


In [130]: port_corr.get_statistics() 


Out[130]: 
Name quant. value curr. pos_value pos_delta pos_vega 
O  am_put_pos 3 4.458556 EUR 13.375668 -2.0376 30.8676 
1 eur_call_pos 5 2.817813 EUR 14.089065 3.3375 42.2340 


However, the correlation takes place behind the scenes. The graphical illustration in 
Figure 20-2 takes the same combination of paths as before. The two paths now almost 
move in parallel: 


In [131]: path_gbm = port_corr.underlying_objects['gbm'].\ 
get_instrument_values()[:, path_no] 
path_jd = port_corr.underlying_objects['jd'].\ 
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get_instrument_values()[:, path_no] 


In [132]: plt.figure(figsize=(10, 6)) 
plt.plot(portfolio.time_grid, path_gbm, 'r', label='gbm') 
plt.plot(portfolio.time_grid, path_jd, 'b', label='jd') 
plt.xticks(rotation=30) 
plt.legend(loc=0); 
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Figure 20-2. Correlated risk factors (two sample paths) 


As a last numerical and conceptual example, consider the frequency distribution of the 
portfolio present value. This is something impossible to generate in general with other 
approaches, like the application of analytical formulae or the binomial option pricing 
model. Setting the parameter full=True causes the complete set of present values per 
option position to be returned after the present value estimation: 


In [133]: pv1 = 5 * port_corr.valuation_objects['eur_call_pos'].\ 
present_value(full=True)[1] 

pv1 

Out[133]: array([ 0. , 39.71423714, 24.90720272, ..., 0. ; 

6.42619093, 8.15838265]) 


In [134]: pv2 = 3 * port_corr.valuation_objects['am_put_pos'].\ 
present_value(full=True) [1] 
pv2 
array([21.31806027, 10.71952869, 19.89804376, ..., 21.39292703, 
17.59920608, 0. 1) 


Out[134]: 


Derivatives Portfolios | 631 


First, compare the frequency distribution of the two positions. The payoff profiles of 
the two positions, as displayed in Figure 20-3, are quite different. Note that the values 
for both the x- and y-axes are limited for better readability: 


In [135]: plt.figure(figsize=(10, 6)) 
plt.hist([pv1, pv2], bins=25, 
label=['European call', 'American put']); 

plt.axvline(pv1.mean(), color='r', ls='dashed', 

lw=1.5, Label='call mean = %4.2f' % pv1.mean()) 
plt.axvline(pv2.mean(), color='r', ls='dotted', 

lw=1.5, Label='put mean = %4.2f' % pv2.mean()) 
plt.xlim(O, 80); plt.ylim(0, 10000) 
plt.legend(); 
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Figure 20-3. Frequency distribution of present values of the two positions 


Figure 20-4 finally shows the full frequency distribution of the portfolio present val- 
ues. One can clearly see the offsetting diversification effects of combining a call with a 
put option: 


In [136]: pvs = pv1 + pv2 
plt.figure(figsize=(10, 6)) 
plt.hist(pvs, bins=50, label='portfolio'); 
plt.axvline(pvs.mean(), color='r', ls='dashed', 
lw=1.5, label='mean = %4.2f' % pvs.mean()) 
plt.xlim(0, 80); plt.ylim(0, 7000) 
plt.legend(); 
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Figure 20-4. Portfolio frequency distribution of present values 


What impact does the correlation between the two risk factors have on the risk of the 
portfolio, measured in the standard deviation of the present values? This can be 
answered by the following two estimations: 


In [137]: pvs.std() 1] 
Out[137]: 16.723724772741118 


In [138]: pv1 


(5 * portfolio.valuation_objects['eur_call_pos']. 
present_value(full=True)[1]) 
(3 * portfolio.valuation_objects['am_put_pos']. 
present_value(full=True)[1]) 
(pv1 + pv2).std() 
Out[138]: 21.80498672323975 


pv2 


@ Standard deviation of portfolio values with correlation. 


@ Standard deviation of portfolio values without correlation. 


Although the mean value stays constant (ignoring numerical deviations), correlation 
obviously significantly decreases the portfolio risk when measured in this way. Again, 
this is an insight that it is not really possible to gain when using alternative numerical 
methods or valuation approaches. 
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Conclusion 


This chapter addresses the valuation and risk management of a portfolio of multiple 
derivatives positions dependent on multiple (possibly correlated) risk factors. To this 
end, a new class called derivatives_position is introduced to model an options or 
derivatives position. The main focus, however, lies on the derivatives_portfolio 


class, which implements some more complex tasks. For example, the class takes care 
of: 


e Correlations between risk factors (the class generates a single consistent set of 
random numbers for the simulation of all risk factors) 


e Instantiation of simulation objects given the single market environments and the 
general valuation environment, as well as the derivatives positions 


e Generation of portfolio statistics based on all the assumptions, the risk factors 
involved, and the terms of the derivatives positions 


The examples presented in this chapter can only show some simple versions of deriv- 
atives portfolios that can be managed and valued with the DX package developed so 
far and the derivatives_portfolio class. Natural extensions to the DX package 
would be the addition of more sophisticated financial models, like a stochastic vola- 
tility model, and multi-risk valuation classes to model and value derivatives depen- 
dent on multiple risk factors (like a European basket option or an American 
maximum call option, to name just two). At this stage, the modular modeling using 
OOP and the application of a valuation framework as general as the Fundamental 
Theorem of Asset Pricing (or “global valuation”) play out their strengths: the non- 
redundant modeling of the risk factors and the accounting for the correlations 
between them will then also have a direct influence on the values and Greeks of 
multi-risk derivatives. 


The following is a final wrapper module bringing all the components of the DX analyt- 
ics package together for a single import statement: 


# 

# DX Package 

# 

# All components 

# 

# dx_package.py 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

from dx_valuation import * 

from derivatives_position import derivatives_position 
from derivatives_portfolio import derivatives_portfolio 
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And here is the now-complete __init__.py file for the dx folder: 


# 

# DX Package 

# packaging file 
#__init__.py 

# 

import numpy as np 
import pandas as pd 
import datetime as dt 


# frame 

from get_year_deltas import get_year_deltas 

from constant_short_rate import constant_short_rate 
from market_environment import market_environment 
from plot_option_stats import plot_option_stats 


# simulation 

from sn_random_numbers import sn_random_numbers 

from simulation_class import simulation_class 

from geometric_brownian_motion import geometric_brownian_motion 
from jump_diffusion import jump_diffusion 

from square_root_diffusion import square_root_diffusion 


# valuation 

from valuation_class import valuation_class 

from valuation_mcs_european import valuation_mcs_european 
from valuation_mcs_american import valuation_mcs_american 


# portfolio 
from derivatives_position import derivatives_position 
from derivatives_portfolio import derivatives_portfolio 


Further Resources 


As for the preceding chapters on the DX derivatives analytics package, Glasserman 
(2004) is a comprehensive resource for Monte Carlo simulation in the context of 
financial engineering and applications. Hilpisch (2015) also provides Python-based 
implementations of the most important Monte Carlo algorithms: 


e Glasserman, Paul (2004). Monte Carlo Methods in Financial Engineering. New 
York: Springer. 

e Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


However, there is hardly any research available when it comes to the valuation of 
(complex) portfolios of derivatives in a consistent, nonredundant fashion by Monte 
Carlo simulation. A notable exception, at least from a conceptual point of view, is the 
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brief article by Albanese, Gimonet, and White (2010a). There is a bit more detail in 
the working paper by the same team of authors: 


e Albanese, Claudio, Guillaume Gimonet and Steve White (2010a). “Towards a 
Global Valuation Model”. Risk Magazine, Vol. 23, No. 5, pp. 68-71. 


e Albanese, Claudio, Guillaume Gimonet and Steve White (2010b). “Global Valua- 
tion and Dynamic Risk Management”. Working paper. 
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CHAPTER 21 
Market-Based Valuation 


Weare facing extreme volatility. 


—Carlos Ghosn 


A major task in derivatives analytics is the market-based valuation of options and 
derivatives that are not liquidly traded. To this end, one generally calibrates a pricing 
model to market quotes of liquidly traded options and uses the calibrated model for 
the pricing of the non-traded options.’ 


This chapter presents a case study based on the DX package and illustrates that this 
package, as developed step-by-step in the previous four chapters, is suited to imple- 
ment a market-based valuation. The case study is based on the DAX 30 stock index, 
which is a blue chip stock market index consisting of stocks of 30 major German 
companies. On this index, liquidly traded European call and put options are avail- 
able. 


The chapter is divided into sections that implement the following major tasks: 


“Options Data” on page 638 
One needs two types of data, namely for the DAX 30 stock index itself and for 
the liquidly traded European options on the index. 


“Model Calibration” on page 641 
To value the non-traded options in a market-consistent fashion, one generally 
first calibrates the chosen model to quoted option prices in such a way that the 
model based on the optimal parameters replicates the market prices as well as 
possible. 


1 For details, refer to Hilpisch (2015). 
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“Portfolio Valuation” on page 651 
Equipped with the data and a market-calibrated model for the DAX 30 stock 
index, the final task then is to model and value the non-traded options; impor- 
tant risk measures are also estimated on a position and portfolio level. 


The index and options data used in this chapter are from the Thomson Reuters Eikon 
Data API (see “Python Code” on page 654). 


Options Data 


To get started, here are the required imports and customizations: 


In [1]: import numpy as np 
import pandas as pd 
import datetime as dt 


In [2]: from pylab import mpl, plt 
plt.style.use('seaborn') 
mpL.rcParams['font.family'] = 'serif' 
%matpLotlib inline 


In [3]: import sys 
sys.path.append('../') 
sys.path. append('../dx') 


Given the data file as created in “Python Code” on page 654, the options data is read 
with pandas and processed such that date information is given as pd. Timestamp 
objects: 


In [4]: dax = pd.read_csv('../../source/tr_eikon_option_data.csv', 
index_col=0) 


In [5]: for col in ['CF_DATE', 'EXPIR_DATE']: 
dax[col] = dax[col].apply(lambda date: pd.Timestamp(date)) (2) 


In [6]: dax.info() © 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 115 entries, 0 to 114 
Data columns (total 7 columns): 
Instrument 115 non-null object 
CF_DATE 115 non-null datetime64[ns ] 
EXPIR_DATE 114 non-null datetime64[ns] 
PUTCALLIND 114 non-null object 
STRIKE_PRC 114 non-null float64 
CF_CLOSE 115 non-null float64 
IMP_VOLT 114 non-null float64 
dtypes: datetime64[ns](2), float64(3), object(2) 
memory usage: 7.2+ KB 


In [7]: dax.set_index('Instrument').head(7) © 
Out[7]: 


638 | Chapter 21: Market-Based Valuation 


Instrument 
. GDAXT 


GDAX105000G8. 
GDAX105000S8. 
GDAX108000G8 . 
GDAX108000S8. 
GDAX110000G8. 
GDAX110000S8. 


Instrument 
. GDAXT 


GDAX105000G8. 
GDAX105000S8. 
GDAX108000G8 . 
GDAX108000S8. 
GDAX110000G8. 
GDAX110000S8. 


EX 
EX 
EX 
EX 
EX 
EX 


EX 
EX 
EX 
EX 
EX 
EX 


CF_DATE 


2018-04-27 
2018-04-27 
2018-04-27 
2018-04-27 
2018-04-26 
2018-04-27 
2018-04-27 


IMP_VOLT 


NaN 
23.59 
23.59 
22.02 
22.02 
21.00 
21.00 


EXPIR_DATE PUTCALLIND STRIKE_PRC CF_CLOSE \ 


NaT 
2018-07-20 
2018-07-20 
2018-07-20 
2018-07-20 
2018-07-20 
2018-07-20 


@ Reads the data with pd. read_csv(). 


© The resulting DataFrame object. 


NaN 
CALL 
PUT 
CALL 
PUT 
CALL 
PUT 


@ Processes the two columns with date information. 


In [8]: tnitial_value = dax.iloc[0]['CF_CLOSE'] 1] 


NaN 12500. 
10500.0 2040. 
10500.0 32. 
10800.0 1752. 
10800.0 43. 
11000.0 1562. 
11000.0 54. 


In [9]: calls = dax[dax['PUTCALLIND'] == 'CALL'].copy() (2) 


puts = dax[dax['PUTCALLIND'] == 'PUT '].copy() 


In [10]: calls.set_index('STRIKE_PRC')[['CF_CLOSE', 'IMP_VOLT']].plot( 
secondary_y='IMP_VOLT', style=['bo', 'rv'], figsize=(10, 6)); 


Assigns the relevant index level to the initial_value variable. 
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The following code stores the relevant index level for the DAX 30 in a variable and 
creates two new DataFrame objects, one for calls and one for puts. Figure 21-1 
presents the market quotes for the calls and their implied volatilities: 


Separates the options data for calls and puts into two new DataFrame objects. 


2 The implied volatility of an option is the volatility value that gives, ceteris paribus, when put into the Black- 
Scholes-Merton (1973) option pricing formula, the market quote of the option. 
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Figure 21-1. Market quotes and implied volatilities for European call options on the 
DAX 30 


Figure 21-2 presents the market quotes for the puts and their implied volatilities: 


In [11]: ax = puts.set_index('STRIKE_PRC')[['CF_CLOSE', 'IMP_VOLT']].plot( 
secondary_y='IMP_VOLT', style=['bo', 'rv'], figsize=(10, 6)) 
ax.get_legend().set_bbox_to_anchor((0.25, 0.5)); 
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Figure 21-2. Market quotes and implied volatilities for European put options on the 
DAX 30 


Model Calibration 


This section selects the relevant market data, models the European options on the 
DAX 30 index, and implements the calibration procedure itself. 


Relevant Market Data 


Model calibration generally takes place based on a smaller subset of the available 
option market quotes.’ To this end, the following code selects only those European 
call options whose strike price is relatively close to the current index level (see 
Figure 21-3). In other words, only those European call options are selected that are 
not too far in-the-money or out-of-the-money: 


In [12]: limit = 500 @ 


In [13]: option_selection = calls[abs(calls['STRIKE_PRC'] - initial_value) 
< limit].copy() @ 


In [14]: option_selection.info() © 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 20 entries, 43 to 81 
Data columns (total 7 columns): 


3 See Hilpisch (2015), Chapter 11, for more details. 
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Instrument 20 non-null object 

CF_DATE 20 non-null datetime64[ns ] 
EXPIR_DATE 20 non-null datetime64[ns ] 
PUTCALLIND 20 non-null object 

STRIKE_PRC 20 non-null float64 

CE_CLOSE 20 non-null float64 

IMP_VOLT 20 non-null float64 

dtypes: datetime64[ns](2), float64(3), object(2) 
memory usage: 1.2+ KB 


In [15]: option_selection.set_index('Instrument').tail() © 


Out[15]: 
CF_DATE EXPIR_DATE PUTCALLIND STRIKE_PRC CF_CLOSE \ 

Instrument 
GDAX128000G8.EX 2018-04-27 2018-07-20 CALL 12800.0 182.4 
GDAX128500G8.EX 2018-04-27 2018-07-20 CALL 12850.0 162.0 
GDAX129000G8.EX 2018-04-25 2018-07-20 CALL 12900.0 142.9 
GDAX129500G8.EX 2018-04-27 2018-07-20 CALL 12950.0 125.4 
GDAX130000G8.EX 2018-04-27 2018-07-20 CALL 13000.0 109.4 

IMP_VOLT 
Instrument 
GDAX128000G8 . EX 12.70 
GDAX128500G8. EX T2252 
GDAX129000G8 . EX 12.36 
GDAX129500G8. EX 12.21 
GDAX130000G8 . EX 12.06 


In [16]: option_selection.set_index('STRIKE_PRC')[['CF_CLOSE', 'IMP_VOLT']].plot( 
secondary_y='IMP_VOLT', style=['bo', 'rv'], figsize=(10, 6)); 


@ Sets the limit value for the derivation of the strike price from the current index 
level (moneyness condition). 


@ Selects, based on the Limit value, the European call options to be included for 
the calibration. 


© The resulting DataFrame with the European call options for the calibration. 
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Figure 21-3. European call options on the DAX 30 used for model calibration 


Option Modeling 


Having the relevant market data defined, the DX package can now be used to model 
the European call options. The definition of the dx.market_environment object to 
model the DAX 30 index follows, along the lines of the examples in previous 
chapters: 


In 


[17]: 
[18]: 
[19]: 
[20]: 


[21]: 


[22]: 


[23]: 


import dx 


pricing_date = 


option_selection[ 'CF_DATE'].max() (1) 


me_dax = dx.market_environment('DAX30', pricing_date) (2) 


maturity = pd.Timestamp(calls.iloc[0]['EXPIR_DATE']) © 


me_dax.add_constant('initial_value', initial_value) (4) 
me_dax.add_constant('final_date', maturity) (4) 


me_dax.add_constant('currency', 'EUR') 


me_dax.add_constant('frequency', 'B') (5) 


me_dax.add_constant('paths', 10000) (5 


csr = dx.constant_short_rate('csr', 0.01) (6) 
me_dax.add_curve('discount_curve', csr) © 


@ Defines the initial or pricing date given the options data. 
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Instantiates the dx.market_environment object. 
Defines the maturity date given the options data. 


Adds the basic model parameters. 


© 6 8 8 


Adds the simulation-related parameters. 


Defines and adds a dx. constant_short_rate object. 


This code then adds the model-specific parameters for the dx. jump_diffusion class 
and instantiates a respective simulation object: 


In [24]: me_dax.add_constant('volatility', 0.2) 
me_dax.add_constant('lambda', 0.8) 
me_dax.add_constant('mu', -0.2) 
me_dax.add_constant('delta', 0.1) 


In [25]: dax_model = dx.jump_diffusion('dax_model', me_dax) 


As an example for a European call option, consider the following parameterization 
for which the strike is set equal to the current index level of the DAX 30. This allows 
for a first value estimation based on Monte Carlo simulation: 


In [26]: me_dax.add_constant('strike', initial_value) (1) 
me_dax.add_constant('maturity', maturity) 


In [27]: payoff_func = 'np.maximum(maturity_value - strike, 0)' (2) 


In [28]: dax_eur_call = dx.valuation_mcs_european('dax_eur_call', 
dax_model, me_dax, payoff_func) © 


In [29]: dax_eur_call.present_value() (4) 
Out[29]: 654.298085 


Sets the value for strike equal to the initial_value. 


© 


Defines the payoff function for a European call option. 


© 


Instantiates the valuation object. 


Initiates the simulation and value estimation. 


Similarly, valuation objects can be defined for all relevant European call options on 
the DAX 30 index. The only parameter that changes is the strike price: 


In [30]: option_models = {} @ 
for option in option_selection. index: 
strike = option_selection['STRIKE_PRC'].loc[option] (2) 
me_dax.add_constant('strike', strike) 
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option_models[strike] = dx.valuation_mcs_european( 


‘eur_call_%d' % strike, 
dax_model, 

me_dax, 

payoff_func) 


@ The valuation objects are collected in a dict object. 


@ Selects the relevant strike price and (re)defines it in the dx.market_environment 


object. 


Now, based on the valuation objects for all relevant options, the function calcu 
late_model_values() returns the model values for all options given a set of the 
model-specific parameter values pO: 


In [32]: def calculate_model_values(p0): 
''' Returns all relevant option values. 


Parameters 


pO: tuple/list 


tuple of kappa, theta, volatility 


Returns 


model_values: dict 


mr 


dictionary with model values 


volatility, Lamb, mu, delta = pO 
dax_model.update(volatility=volatility, Lamb=Lamb, 


mu=mu, delta=delta) 


return { 


strike: model.present_value(fixed_seed=True) 
for strike, model in option_models.items() 


} 


In [33]: calculate_model_values((0.1, 0.1, -0.4, 0.0)) 


Out[33]: {12050. 
12100. 
12150. 
12200. 
12250. 
12300. 
12350. 
12400. 
12450. 
12500. 
12550. 
12600. 
12650. 
12700. 


0: 


eoooooooooooo 


611.222524, 
571.83659, 
533.595853, 
496.607225, 
460.863233, 
426.543355, 
393.626483, 
362.066869, 
331.877733, 
303: 133596; 
275.987049, 
250.504646, 
226-687523, 
204.550609, 
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12750.0: 184.020514, 
12800.0: 164.945082, 
12850.0: 147.249829, 
12900.0: 130.831722, 
12950.0: 115.681449, 
13000.0: 101.917351} 


The function calculate_model_values() is used during the calibration procedure, 
as described next. 


Calibration Procedure 


Calibration of an option pricing model is, in general, a convex optimization problem. 
The most widely used function for the calibration—i.e., the minimization of some 
error function value—is the mean-squared error (MSE) for the model option values 
given the market quotes of the options.* Assume there are N relevant options, and 
also model and market quotes. The problem of calibrating an option pricing model to 
the market quotes based on the MSE is then given in Equation 21-1. There, C, and 
C,"° are the market price and the model price of the nth option, respectively. p is the 
parameter set provided as input to the option pricing model. 


Equation 21-1. Mean-squared error for model calibration 


N 
min 57 È (C; - C° (p)? 


The Python function mean_squared_error() implements this approach to model cal- 
ibration technically. A global variable i is used to control the output of intermediate 
parameter tuple objects and the resulting MSE: 


In [34]: i = 0 
def mean_squared_error(p0): 
"'' Returns the mean-squared error given 
the model and market values. 


Parameters 


pO: tuple/list 
tuple of kappa, theta, volatility 


Returns 


MSE: float 
mean-squared error 


4 There are multiple alternatives to define the target function for the calibration procedure. See Hilpisch (2015), 
Chapter 11, for a discussion of this topic. 
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het 


global i 
model_vaLlues = np.array(List( 
calculate_model_values(p0).values())) 1) 
market_values = option_selection[ 'CF_CLOSE'].values (2) 
option_diffs = model_values - market_values 
MSE = np.sum(option_diffs ** 2) / len(option_diffs) (4) 
if i % 75 == 0: 
if i == 0: 
print('%4s %6s %6s %6s %6s --> %6s' % 
('i', 'vola', 'lambda', 'mu', 'delta', 'MSE')) 
print('%4d %6.3f %6.3f %6.3f %6.3f --> %6.3f' % 
(i, polo], poli], pol2], pO[3], MSEJ) 
i += 1 
return MSE 


In [35]: mean_squared_error((0.1, 0.1, -0.4, 0.0)) (5) 


i vola lambda mu delta --> MSE 
© 0.100 0.100 -0.400 0.000 --> 728.375 


Out[35]: 728.3752973715275 
Estimates the set of model values. 
Picks out the market quotes. 


Calculates element-wise the differences between the two. 


© © 8 Ọ 


Calculates the mean-squared error value. 


Illustrates such a calculation based on sample parameters. 


Chapter 11 introduces the two functions (spo.brute() and spo.fmin()) that are 
used to implement the calibration procedure. First, the global minimization based on 
ranges for the four model-specific parameter values. The result is an optimal parame- 
ter combination given all the parameter combinations checked during the brute force 
minimization: 


In [36]: import scipy.optimize as spo 


In [37]: %%time 
i=0 
opt_global = spo.brute(mean_squared_error, 
((0.10, 0.201, 0.025), # range for volatility 
(0.10, 0.80, 0.10), # range for jump intensity 
(-0.40, 0.01, 0.10), # range for average jump size 
(0.00, 0.121, 0.02)), # range for jump variability 
finish=None) 
at vola lambda mu delta --> MSE 
0 0.100 0.100 -0.400 0.000 --> 728.375 
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75 0.100 0.300 -0.400 0.080 --> 5157.513 
150 0.100 0.500 -0.300 0.040 --> 12199.386 
225 0.100 0.700 -0.200 0.000 --> 6904.932 
300 0:125 0.200 -0.200 0.100 --> 855.412 
StS 0.125 0.400 -0.100 0.060 --> 621.800 
450 0.125 0.600 0.000 0.020 --> 544.137 
525 0.150 0.100 0.000 0.120 --> 3410.776 
600 0.150 0.400 -0.400 0.080 --> 46775.769 
675 0.150 0.600 -0.300 0.040 --> 56331.321 
750 0.175 0.100 -0.200 0.000 --> 14562.213 
825 0.175 0.300 -0.200 0.100 --> 24599.738 
900 0.175 0.500 -0.100 0.060 --> 19183.167 
975 0:175 0.700 0.000 0.020 --> 11871.683 

1050 0.200 0.200 0.000 0.120 --> 31736.403 
1125 0.200 0.500 -0.400 0.080 --> 130372.718 
1200 0.200 0.700 -0.300 0.040 --> 126365.140 
CPU times: user 1min 45s, sys: 7.07 s, total: 1min 52s 


Wall time: 1min 56s 


In [38]: mean_squared_error(opt_global) 
Out[38]: 17.946670038040985 


The opt_global values are intermediate results only. They are used as starting values 
for the local minimization. Given the parameterization used, the opt_local values 
are final and optimal given certain assumed tolerance levels: 


In [39]: %%time 
i= 0 
opt_local = spo.fmin(mean_squared_error, opt_global, 
xtol=0.00001, ftol=0.00001, 
maxiter=200, maxfun=550) 
T vola lambda mu delta --> MSE 
© 0.100 0.200 -0.300 0.000 --> 17.947 
75 0.098 0.216 -0.302 -0.001 --> 7.885 
150 0.098 0.216 -0.300 -0.001 --> 7.371 
Optimization terminated successfully. 
Current function value: 7.371163 
Iterations: 100 
Function evaluations: 188 
CPU times: user 15.6 s, sys: 1.03 s, total: 16.6 s 
Wall time: 16.7 s 


In [40]: i = 0 
mean_squared_error(opt_local) (13 
i vola lambda mu delta --> MSE 
0 0.098 0.216 -0:300 -0.001 --> 7.371 


Out[40]: 7.371162645265256 
In [41]: calculate_model_values(opt_local) (2) 


Out[41]: {12050.0: 647.428189, 
12100.0: 607.402796, 
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@ The mean-squared error given the optimal parameter values. 


12150. 
12200. 
12250. 
12300. 
12359; 
12400. 
12450. 
12500. 
12550. 
12600. 
12650. 
12700. 
12750. 
12800. 
12850. 
12900. 
12950. 
13000. 


eoooooooooooooooo se] 


568. 
530. 
494. 
458. 
424. 
392. 
360. 
330; 
302. 
274. 
249. 
225% 
203. 
182 
163 
146. 
129, 
114. 


46137, 

703659, 
093839, 
718401, 
650128, 
023241, 
728543, 
727256, 
117223, 
98474, 

501807, 
678695, 
490065, 


-947468, 
-907583, 


259349, 
909743, 
852425} 


© The model values given the optimal parameter values. 


Next, we compare the model values for the optimal parameters with the market 
quotes. The pricing errors are calculated as the absolute differences between the 
model values and market quotes and as the deviation in percent from the market 


quotes: 


In [42]: option_selection['MODEL'] = np.array(list(calculate_model_values( 


In [43]: 
Out[43]: 


option_selection[['MODEL', 'CF_CLOSE', 'ERRORS_EUR', ‘ERRORS %']] 


43 647. 
45 607. 
47 568. 
49 530. 
51 494. 
53 458. 
55 424. 
57 392. 
59 360. 
61 330. 
63 302. 
65 274. 
67 249. 
69 225. 
71 203. 


MODEL CF_CLOSE 


428189 
402796 
461370 
703659 
093839 
718401 
650128 
023241 
728543 
727256 
117223 
984740 
501807 
678695 
490065 


642. 
604. 
567. 
530. 
494. 
460. 
426. 
394. 
363. 
333% 
304. 
Pa ae 
251. 
227; 
204. 


PWN MNDAWWHhLDAWAARPrP HAA 


ERRORS_EUR 


4. 
- 002796 
ooLare 
- 303659 
. 706161 
7981599 
- 149872 
- 376759 
-571457 
-572744 
-682777 
. 515260 
. 198193 
-621305 
- 609935 


828189 


opt_local).values())) 
option_selection[ 'ERRORS_EUR'] = (option_selection['MODEL'] - 
option_selection[ 'CF_CLOSE']) 
option_selection[ 'ERRORS_%'] = (option_selection[ 'ERRORS_EUR'] / 
option_selection['CF_CLOSE']) * 100 


ERRORS_% 


0. 
. 496823 
. 240058 
.057251 
. 142716 
. 343602 
, 503719 
. 602627 
. 707805 
. 771900 
- 880176 
. 906400 
. 873338 
3713289 
- 298841 


751352 
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73 182.947468 182.4 0.547468 0.300147 


75 163.907583 162.0 1.907583: -1,177520 
77 146.259349 142.9 3.359349 2.350839 
79- 129.909743 125.4 4.509743 3.596286 
81 114.852425 109.4 5.452425 4.983935 


In [44]: round(option_selection['ERRORS_EUR'].mean(), 3) (1) 
Out[44]: 0.184 


In [45]: round(option_selection[ 'ERRORS_%'].mean(), 3) (2) 
Out[45]: 0.36 


@ The average pricing error in EUR. 


© The average pricing error in percent. 
Figure 21-4 visualizes the valuation results and errors: 


In [46]: fix, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, figsize=(10, 10)) 
strikes = option_selection[ 'STRIKE_PRC'].values 
ax1.plot(strikes, option_selection['CF_CLOSE'], label='market quotes') 
ax1.plot(strikes, option_selection['MODEL'], 'ro', Label='model values') 
ax1.set_ylabel('option values') 
ax1.legend(loc=0) 
wi = 15 
ax2.bar(strikes - wi / 2., option_selection['ERRORS_EUR'], width=wi) 
ax2.set_ylabel('errors [EUR]') 
ax3.bar(strikes - wi / 2., option_selection['ERRORS_%'], width=wi) 
ax3.set_ylabel('errors [%]') 
ax3.set_xlabel('strikes'); 


Calibration Speed 


The calibration of an option pricing model to market data in gen- 
eral requires the recalculation of hundreds or even thousands of 
option values. This is therefore typically done based on analytical 
pricing formulae. Here, the calibration procedure relies on Monte 
Carlo simulation as the pricing method, which is computationally 
more demanding compared to analytical methods. Nevertheless, 
the calibration procedure does not take “too long” even on a typical 
notebook. The use of parallelization techniques, for instance, can 
speed up the calibration considerably. 


650 | Chapter 21: Market-Based Valuation 


600 —— market quotes 
@ model values 


option values 
» 
So 
© 


errors [EUR] 
o N > 
@ 
sa 
al 
CE] 
EEE) 


| 
N 


errors [%] 

oO = N w 
= 

E 

E 

I 

| 

Baa 
nl 
B 
[n] 


12000 12200 12400 12600 12800 13000 
strikes 


Figure 21-4. Model values and market quotes after calibration 


Portfolio Valuation 


Being equipped with a calibrated model reflecting realities in the financial markets as 
represented by market quotes of liquidly traded options enables one to model and 
value non-traded options and derivatives. The idea is that calibration “infuses” the 
correct risk-neutral martingale measure into the model via optimal parameters. 
Based on this measure, the machinery of the Fundamental Theorem of Asset Pricing 
can then be applied to contingent claims beyond those used for the calibration. 


This section considers a portfolio of American put options on the DAX 30 index. 
There are no such options available that are liquidly traded on exchanges. For sim- 
plicity, it is assumed that the American put options have the same maturity as the 
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European call options used for the calibration. Similarly, the same strikes are 
assumed. 


Modeling Option Positions 


First, the market environment for the underlying risk factor, the DAX 30 stock index, 
is modeled with the optimal parameters from the calibration being used: 


In [47]: me_dax = dx.market_environment('me_dax', pricing_date) 
me_dax.add_constant('initial_value', initial_value) 
me_dax.add_constant('final_date', pricing_date) 
me_dax.add_constant('currency', 'EUR') 


In [48]: me_dax.add_constant('volatility', opt_local[0]) (1) 
me_dax.add_constant('lambda', opt_local[1]) (1) 
me_dax.add_constant('mu', opt_local[2]) (13 
me_dax.add_constant('delta', opt_local[3]) (1) 


In [49]: me_dax.add_constant('model', 'jd') 


@ This adds the optimal parameters from the calibration. 


Second, the option positions and the associated environments are defined and stored 
in two separate dict objects: 


In [50]: payoff_func = 'np.maximum(strike - instrument_values, 0)' 


In [51]: shared = dx.market_environment('share', pricing_date) (1) 
shared.add_constant('maturity', maturity) (1) 
shared.add_constant('currency', 'EUR') 1 


In [52]: option_positions = {} 
option_environments = {} 
for option in option_selection. index: 
option_environments[option] = dx.market_environment( 
‘am_put_%d' % option, pricing_date) 
strike = option_selection[ 'STRIKE_PRC'].loc[option] © 
option_environments[option].add_constant('strike', strike) © 
option_environments[option].add_environment(shared) (4) 
option_positions['am_put_%d' % strike] = \ 
dx.derivatives_position( 
'am_put_%d' % strike, 
quantity=np.random.randint(10, 50), 
underlying='dax_model', 
mar_env=option_environments[option], 
otype='American', 
payoff_func=payoff_func) (5) 


@ Defines a shared dx.market_environment object as the basis for all option- 
specific environments. 
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© Defines and stores a new dx.market_environment object for the relevant Ameri- 
can put option. 


© Defines and stores the strike price parameter for the option. 


© Adds the elements from the shared dx.market_environment object to the 
option-specific one. 


© Defines the dx.derivatives_position object with a randomized quantity. 


The Options Portfolio 


To value the portfolio with all the American put options, a valuation environment is 
needed. It contains the major parameters for the estimation of position values and 
risk statistics: 


In [53]: val_env = dx.market_environment('val_env', pricing_date) 
val_env.add_constant('starting_date', pricing_date) 
val_env.add_constant('final_date', pricing_date) 1] 
val_env.add_curve('discount_curve', csr) 
val_env.add_constant('frequency', 'B') 
val_env.add_constant('paths', 25000) 


In [54]: underlyings = {'dax_model' : me_dax} (2) 


In [55]: portfolio = dx.derivatives_portfolio('portfolio', option_positions, 
val_env, underlyings) © 


In [56]: %time results = portfolio.get_statistics(fixed_seed=True) 
CPU times: user 1min 5s, sys: 2.91 s, total: 1min 8s 


Wall time: 38.2 s 


In [57]: results.round(1) 


Out[57]: name quant. value curr. pos_value pos_delta pos_vega 
O am _put_12050 33 151.6 EUR 5002.8 -4.7  38206.9 
1 am_put_12100 38. 161:5 EUR 6138.4 -$.7 551365.2 
c am_put_12150 20 171:3 EUR 3426.8 3.3 27894.5 
3 am_put_12200 12 183.9 EUR 2206.6 -2.2 18479.7 
4 am_put_12250 37 197.4 EUR 7302.8 -7.3  59423.5 
5 am_put_12300 37 212.3 EUR 7853.9 -8.2 65911.9 
6 am_put_12350 36 228.4 EUR 8224.1 -9.0 70969 .4 
Fá am_put_12400 16 244.3 EUR 3908.4 -4.3 32871.4 
8 am_put_12450 17 262.7 EUR 4465.6 -5.1 37451.2 
9 am_put_12500 16 283.4 EUR 4534.8 -5.2 36158,2 
10 am_put_12550 38 305.3 EUR 11602.3 -13.3 86869.9 
11 am_put_12600 10 330.4 EUR 3303.9 -3.9 22144.5 
12 am_put_12650 38 355.5 EUR 13508.3 -16.0 89124.8 
13 am_put_12700 40 384.2 EUR 15367.5 -18.6 90871.2 
14 am_put_12750 13 413:5 EUR SST yt -6.5 28626.0 
15  am_put_12800 49 445.0 EUR 21806.6 -26,3 105287.3 
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16 am_put_12850 30 477.4 EUR 14321.8 -17.0 60757.2 


17 am_put_12900 33 510:3 EUR 16840.1 -19.7 69163.6 
18 am_put_12950 40 544.4 EUR 21777.0 -24.9 80472.3 
19 am_put_13000 35 582.3 EUR 20378.9 -22:9 66522.6 


In [58]: results[['pos_value','pos_delta','pos_vega']].sum().round(1) 


Out[58]: pos_value 197346.2 
pos_delta -224.0 
pos_vega 1138571:1 


dtype: float64 


@ The final_date parameter is later reset to the final maturity date over all options 
in the portfolio. 


© The American put options in the portfolio are all written on the same underlying 
risk factor, the DAX 30 stock index. 


© This instantiates the dx.derivatives_portfolio object. 


The estimation of all statistics takes a little while, since it is all based on Monte Carlo 
simulation and such estimations are particularly compute-intensive for American 
options due to the application of the Least-Squares Monte Carlo (LSM) algorithm. 
Because we are dealing with long positions of American put options only, the portfo- 
lio is short delta and long vega. 


Python Code 


The following presents code to retrieve options data for the German DAX 30 stock 
index from the Eikon Data API: 


In [1]: import eikon as ek (1) 
import pandas as pd 
import datetime as dt 
import configparser as cp 
In [2]: cfg = cp.ConfigParser() (2) 
cfg.read('eikon.cfg') 
Out[2]: ['eikon.cfg'] 
In [3]: ek.set_app_id(cfg['eikon']['app_id']) (2) 


In [4]: fields = ['CF_DATE', 'EXPIR_DATE', 'PUTCALLIND', 
'STRIKE_PRC', 'CF_CLOSE', 'IMP_VOLT'] © 


In [5]: dax = ek.get_data('O#GDAXN8*.EX', fields=fields)[0] (4) 
In [6]: dax.info() (4) 


<class 'pandas.core.frame.DataFrame'> 
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RangeIndex: 115 entries, 0 to 114 
Data columns (total 7 columns): 
Instrument 115 non-null object 
CF_DATE 115 non-null object 
EXPIR_DATE 114 non-null object 
PUTCALLIND 114 non-null object 
STRIKE_PRC 114 non-null float64 
CF_CLOSE 115 non-null float64 
IMP_VOLT 114 non-null float64 
dtypes: float64(3), object(4) 
memory usage: 6.4+ KB 


In [7]: dax['Instrument'] = dax['Instrument'].apply( 
lambda x: x.replace('/', '')) 


In [8]: dax.set_index('Instrument').head(10) 


Out[8]: CF_DATE EXPIR_DATE PUTCALLIND STRIKE_PRC 
Instrument 
. GDAXI 2018-04-27 None None NaN 
GDAX105000G8.EX 2018-04-27 2018-07-20 CALL 10500.0 
GDAX105000S8.EX 2018-04-27 2018-07-20 PUT 10500.0 
GDAX108000G8.EX 2018-04-27 2018-07-20 CALL 10800.0 
GDAX108000S8.EX 2018-04-26 2018-07-20 PUT 10800.0 
GDAX110000G8.EX 2018-04-27 2018-07-20 CALL 11000.0 
GDAX110000S8.EX 2018-04-27 2018-07-20 PUT 11000.0 
GDAX111500G8.EX 2018-04-27 2018-07-20 CALL 11150.0 
GDAX111500S8.EX 2018-04-27 2018-07-20 PUT 11150.0 
GDAX112000G8.EX 2018-04-27 2018-07-20 CALL 11200.0 
IMP_VOLT 
Instrument 
. GDAXI NaN 


GDAX105000G8 . EX 23.59 
GDAX105000S8 . EX 23:59 
GDAX108000G8 . EX 22.02 
GDAX108000S8 . EX 22.02 
GDAX110000G8 . EX 21.00 
GDAX110000S8 . EX 21.00 
GDAX111500G8 . EX 20.24 
GDAX111500S8 . EX 20.25 
GDAX112000G8 . EX 29.99 


In [9]: dax.to_csv('../../source/tr_eikon_option_data.csv') @ 
Imports the eikon Python wrapper package. 
Reads the login credentials for the Eikon Data API. 


Defines the data fields to be retrieved. 


CF_CLOSE \ 


12500.47 
2040.80 
32.00 
1752.40 
43.80 
1562.80 
54.50 
1422.50 
64.30 
1376.10 
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© Retrieves options data for the July 2018 expiry. 
© Replaces the slash character / in the instrument names. 


@ Writes the data set as a CSV file. 


Conclusion 


This chapter presents a larger, realistic use case for the application of the DX analytics 
package to the valuation of a portfolio of non-traded American options on the Ger- 
man DAX 30 stock index. The chapter addresses three main tasks typically involved 
in any real-world derivatives analytics application: 


Obtaining data 
Current, correct market data builds the basis of any modeling and valuation 
effort in derivatives analytics; one needs index data as well as options data for the 
DAX 30. 


Model calibration 

To value, manage, and hedge non-traded options and derivatives in a market- 
consistent fashion, one has to calibrate the parameters of an appropriate model 
(simulation object) to the relevant option market quotes (relevant with regard to 
maturity and strikes). The model of choice is the jump diffusion model, which is 
in some cases appropriate for modeling a stock index; the calibration results are 
quite good although the model only offers three degrees of freedom (Lambda as 
the jump intensity, mu as the expected jump size, and delta as the variability of 
the jump size). 


Portfolio valuation 
Based on the market data and the calibrated model, a portfolio with the Ameri- 
can put options on the DAX 30 index was modeled and major statistics (position 
values, deltas, and vegas) were estimated. 


The realistic use case in this chapter shows the flexibility and the power of the DX 
package; it essentially allows one to address the major analytical tasks with regard to 
derivatives. The very approach and architecture make the application largely compa- 
rable to the benchmark case of a Black-Scholes-Merton analytical formula for Euro- 
pean options. Once the valuation objects are defined, one can use them in a similar 
way as an analytical formula—despite the fact that under the hood, computationally 
demanding and memory-intensive algorithms are applied. 
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Further Resources 


As for previous chapters, the following book is a good general reference for the topics 
covered in this chapter, especially when it comes to the calibration of option pricing 


models: 


e Hilpisch, Yves (2015). Derivatives Analytics with Python. Chichester, England: 
Wiley Finance. 


With regard to the consistent valuation and management of derivatives portfolios, 
see also the resources at the end of Chapter 20. 
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APPENDIX A 
Dates and Times 


As in the majority of scientific disciplines, dates and times play an important role in 
finance. This appendix introduces different aspects of this topic when it comes to 
Python programming. It cannot, of course, be exhaustive. However, it provides an 
introduction to the main areas of the Python ecosystem that support the modeling of 
date and time information. 


Python 


The datetime module from the Python standard library allows for the implementa- 
tion of the most important date and time-related tasks: 
In [1]: from pylab import mpl, plt 
plt.style.use('seaborn' ) 


mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline 


In [2]: import datetime as dt 


In [3]: dt.datetime.now() (1) 
Out[3]: datetime.datetime(2018, 10, 19, 15, 17, 32, 164295) 


In [4]: to = dt.datetime.today() (1) 
to 
Out[4]: datetime.datetime(2018, 10, 19, 15, 17, 32, 177092) 


In [5]: type(to) 
Out[5]: datetime.datetime 


In [6]: dt.datetime.today().weekday() e 
Out[6]: 4 


@ Returns the exact date and system time. 
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© Returns the day of the week as a number, where 0 = Monday. 
Of course, datetime objects can be defined freely: 


In [7]: d = dt.datetime(2020, 10, 31, 10, 5, 30, 500000) @ 
d 
Out[7]: datetime.datetime(2020, 10, 31, 10, 5, 30, 500000) 


In [8]: str(d) @ 
Out[8]: '2020-10-31 10:05:30.500000' 


In [9]: print(d) © 
2020-10-31 10:05:30.500000 


In [10]: d.year 4] 
Out[10]: 2020 


In [11]: d.month (5) 
Out[11]: 10 


In [12]: d.day (6) 
Out[12]: 31 


In [13]: d.hour Q 
Out[13]: 10 


Custom datetime object. 
String representation. 
Printing such an object. 
The year ... 

... Month... 


.. day... 


© © O 6 8 8 8 


... and hour attributes of the object. 
Transformations and split-ups are easily accomplished: 
In [14]: o = d.toordinal() 1] 


o 
Out[14]: 737729 


In [15]: dt.datetime.fromordinal(o) (2) 
Out[15]: datetime.datetime(2020, 10, 31, 0, 0) 


In [16]: t = dt.datetime.time(d) © 
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t 
Out[16]: datetime.time(10, 5, 30, 500000) 


In [17]: type(t) 
Out[17]: datetime.time 


In [18]: dd = dt.datetime.date(d) (4) 
dd 
Out[18]: datetime.date(2020, 10, 31) 


In [19]: d.replace(second=0, microsecond=0) (5) 
Out[19]: datetime.datetime(2020, 10, 31, 10, 5) 


Transformation to ordinal number. 


Transformation from ordinal number. 


Oo 
(2) 
© Splitting up the time component. 
© Splitting up the date component. 
(5) 


Setting selected values to 0. 


timedelta objects result from, among other things, arithmetic operations on date 
time objects (i.e., finding the difference between two such objects): 


In [20]: td = d - dt.datetime.now() (13 
td 
Out[20]: datetime.timedelta(days=742, seconds=67678, microseconds=169720) 


In [21]: type(td) (2) 
Out[21]: datetime.timedelta 


In [22]: td.days 
Out[22]: 742 


In [23]: td.seconds 
Out[23]: 67678 


In [24]: td.microseconds 
Out[24]: 169720 


In [25]: td.total_seconds() © 
Out[25]: 64176478 .16972 


@ The difference between two datetime objects ... 
© ... gives a timedelta object. 


© The difference in seconds. 
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There are multiple ways to transform a datetime object into different representa- 
tions, as well as to generate datetime objects out of, say, str objects. Details are 
found in the documentation of the datetime module. Here are a few examples: 


In [26]: 
Out[26]: 


In [27]: 
Out[27]: 


In [28]: 
Out[28]: 


In [29]: 
Out[29]: 


In [30]: 
Out[30]: 


In [31]: 
Out[31]: 


d.isoformat() (1) 
'2020-10-31T10:05:30.500000' 


d.strftime('%A, %d. %B %Y %I:%M%p') @ 
'Saturday, 31. October 2020 10:05AM' 


dt.datetime.strptime('2017-03-31', '%Y-%m-%d') © 
datetime.datetime(2017, 3, 31, 0, 0) 


dt.datetime.strptime('30-4-16', '%d-%m-%y') © 
datetime.datetime(2016, 4, 30, 0, 0) 


ds = str(d) 
ds 
'2020-10-31 10:05:30.500000' 


dt.datetime.strptime(ds, '%Y-%m-%d %H:%M:%S.%f') © 
datetime.datetime(2020, 10, 31, 10, 5, 30, 500000) 


@ ISO format string representation. 


© Exact template for string representation. 


© datetime object from str object based on template. 


In addition to the now() and today() functions, there is also the utcnow() function, 
which gives the exact date and time information in UTC (Coordinated Universal 
Time, formerly known as Greenwich Mean Time, or GMT). This represents a one- 
hour or two-hour difference from the author’s time zone (Central European Time, 
CET, or Central European Summer Time, CEST): 


In [32]: 
Out[32]: 


In [33]: 
Out[33]: 


In [34]: 
Out[34]: 


dt.datetime.now() 
datetime.datetime(2018, 10, 19, 15, 17, 32, 438889) 


dt.datetime.utcnow() 1] 
datetime.datetime(2018, 10, 19, 13, 17, 32, 448897) 


dt.datetime.now() - dt.datetime.utcnow() (2) 
datetime.timedelta(seconds=7199, microseconds=999995) 


@ Returns the current UTC time. 


@ Returns the difference between local time and UTC time. 
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Another class of the datetime module is the tzinfo class, a generic time zone class 
with methods utcoffset(), dst(), and tzname(). A definition for UTC and CEST 
time might look as follows: 


In [35]: class UTC(dt.tzinfo): 
def utcoffset(self, d): 
return dt.timedelta(hours=0) (13 
def dst(self, d): 
return dt.timedelta(hours=0) (13 
def tzname(self, d): 
return 'UTC' 


In [36]: u = dt.datetime.utcnow() 


In [37]: u 
Out[37]: datetime.datetime(2018, 10, 19, 13, 17, 32, 474585) 


In [38]: u = u.replace(tzinfo=UTC()) (2) 


In [39]: u 
Out[39]: datetime.datetime(2018, 10, 19, 13, 17, 32, 474585, tzinfo=<__main__.UTC 
object at 0x11c9a2320>) 


In [40]: class CEST(dt.tzinfo): 
def utcoffset(self, d): 
return dt.timedelta(hours=2) © 
def dst(self, d): 
return dt.timedelta(hours=1) © 
def tzname(self, d): 
return 'CEST' 


In [41]: c = u.astimezone(CEST()) (4) 
č 

Out[41]: datetime.datetime(2018, 10, 19, 15, 17, 32, 474585, 
tzinfo=<__main__.CEST object at 0x11c9a2cc0>) 

In [42]: c - c.dst() (5) 


Out[42]: datetime.datetime(2018, 10, 19, 14, 17, 32, 474585, 
tzinfo=<__main__.CEST object at 0x11c9a2cc0>) 


No offsets for UTC. 
Attaches the dt.tzinfo object via the replace() method. 
Regular and DST (Daylight Saving Time) offsets for CEST. 


Transforms the UTC time zone to the CEST time zone. 


© © © 8 8 


Gives the DST time for the transformed datetime object. 
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There is a Python module available called pytz that implements the most important 
time zones from around the world: 


In [43]: import pytz 


In [44]: pytz.country_names['US'] (1) 
Out[44]: 'United States' 


In [45]: pytz.country_timezones['BE'] (2) 
Out[45]: ['Europe/Brussels'] 


In [46]: pytz.common_timezones[-10:] © 

Out[46]: ['Pacific/Wake', 
"Pacific/Wallis', 
'US/Alaska', 
'US/Arizona', 
'uS/Central', 
'US/Eastern', 
'US/Hawaii', 
'US/Mountain', 
'US/Pacific', 
'UTC'] 


@ A single country. 
© A single time zone. 


© Some common time zones. 

With pytz, there is generally no need to define custom tzinfo objects: 
In [47]: u = dt.datetime.utcnow() 
In [48]: u = u.replace(tzinfo=pytz.utc) (1) 


In [49]: u 
Out[49]: datetime.datetime(2018, 10, 19, 13, 17, 32, 611417, tzinfo=<UTC>) 


In [50]: u.astimezone(pytz.timezone('CET')) (2) 
Out[50]: datetime.datetime(2018, 10, 19, 15, 17, 32, 611417, tzinfo=<DstTzInfo 
'CET' CEST+2:00:00 DST>) 


In [51]: u.astimezone(pytz.timezone('GMT')) (2) 
Out[51]: datetime.datetime(2018, 10, 19, 13, 17, 32, 611417, tzinfo=<StaticTzInfo 
'GMT'>) 


In [52]: u.astimezone(pytz.timezone('US/Central')) @ 


Out[52]: datetime.datetime(2018, 10, 19, 8, 17, 32, 611417, tzinfo=<DstTzInfo 
'US/Central' CDT-1 day, 19:00:00 DST>) 


@ Defining the tzinfo object via pytz. 
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@ Transforming a datetime object to different time zones. 


NumPy 


NumPy also provides functionality to deal with date and time information: 


In [53]: import numpy as np 


In [54]: nd = np.datetime64('2020-10-31') (1) 
nd 
Out[54]: numpy.datetime64('2020-10-31') 


In [55]: np.datetime_as_string(nd) (1) 
Out[55]: '2020-10-31' 


In [56]: np.datetime_data(nd) (2) 
Out[56]: ('D', 1) 


In [57]: d 
Out[57]: datetime.datetime(2020, 10, 31, 10, 5, 30, 500000) 


In [58]: nd = np.datetime64(d) (3) 
nd 
Out[58]: numpy.datetime64('2020-10-31T10:05:30.500000') 


In [59]: nd.astype(dt.datetime) (4) 
Out[59]: datetime.datetime(2020, 10, 31, 10, 5, 30, 500000) 


Construction from str object and string representation. 


© 


Metainformation about the data itself (type, size). 


© 


Construction from datetime object. 


Conversion to datetime object. 


Another way to construct such an object is by providing a str object, e.g., with the 
year and month and the frequency information. Note that the object value then 
defaults to the first day of the month. The construction of ndarray objects based on 
list objects also is possible: 

In [60]: nd = np.datetime64('2020-10', 'D') 


nd 
Out[60]: numpy.datetime64('2020-10-01') 


In [61]: np.datetime64('2020-10') == np.datetime64('2020-10-01') 
Out[61]: True 


In [62]: np.array(['2020-06-10', '2020-07-10', '2020-08-10'], dtype='datetime64') 
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Out[62]: array(['2020-06-10', '2020-07-10', '2020-08-10'], dtype='datetime64[D]') 


In [63]: np.array(['2020-06-10T12:00:00', '2020-07-10T12:00:00', 
'2020-08-10T12:00:00'], dtype='datetime64[s]') 
Out[63]: array(['2020-06-10T12:00:00', '2020-07-10T12:00:00', 
'2020-08-10T12:00:00'], dtype='datetime64[s]') 


One can also generate ranges of dates by using the function np.arange(). Different 
frequencies (e.g., days, weeks, or seconds) are easily taken care of: 


In [64]: np.arange('2020-01-01', '2020-01-04', dtype='datetime64') (1) 
Out[64]: array(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[D]') 


In [65]: np.arange('2020-01-01', '2020-10-01', dtype='datetime64[M]') (2) 

Out[65]: array(['2020-01', '2020-02', '2020-03', '2020-04', '2020-05', 
'2020-06', '2020-07', '2020-08', '2020-09'], 
dtype='datetime64[M]') 


In [66]: np.arange('2020-01-01', '2020-10-01', dtype='datetime64[wW]')[:10] © 

Out[66]: array(['2019-12-26', '2020-01-02', '2020-01-09', '2020-01-16', 
'2020-01-23', '2020-01-30', '2020-02-06', '2020-02-13', 
'2020-02-20', '2020-02-27'], dtype='datetime64[W]') 


In [67]: dtl = np.arange('2020-01-01T00:00:00', '2020-01-02T00:00:00', 
dtype='datetime64[h]') (4) 
dtl[:10] 
Out[67]: array(['2020-01-01T00', '2020-01-01T01', '2020-01-01T02', 
'2020-01-01T03', '2020-01-01T04', '2020-01-01T05', '2020-01-01T06', 
'2020-01-01T07', '2020-01-01T08', '2020-01-01T09'], 
dtype='datetime64[h]') 


In [68]: np.arange('2020-01-01T00:00:00', '2020-01-02T00:00:00', 
dtype='datetime64[s]')[:10] (5) 

Out[68]: array(['2020-01-01T00:00:00', '2020-01-01T00:00:01', 
'2020-01-01T00:00:02', '2020-01-01T00:00:03', 
'2020-01-01T00:00:04', '2020-01-01T00:00:05', 
'2020-01-01T00:00:06', '2020-01-01T00:00:07', 
'2020-01-01T00:00:08', '2020-01-01T00:00:09'], 

dtype='datetime64[s]') 


In [69]: np.arange('2020-01-01T00:00:00', '2020-01-02T00:00:00', 
dtype='datetime64[ms]')[:10] Q 

Out[69]: array(['2020-01-01T00:00:00.000', '2020-01-01T00:00:00.001', 
'2020-01-01T00:00:00.002', '2020-01-01T00:00:00.003', 
'2020-01-01T00:00:00.004', '2020-01-01T00:00:00.005', 
'2020-01-01T00:00:00.006', '2020-01-01T00:00:00.007', 
'2020-01-01T00:00:00.008', '2020-01-01T00:00:00.009'], 

dtype='datetime64[ms]') 


Daily frequency. 


Monthly frequency. 
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Weekly frequency. 
Hourly frequency. 


Second frequency. 


O © 86 08 


Millisecond frequency. 


Plotting date-time and/or time series data can sometimes be tricky. matplotlib has 
support for standard datetime objects. Transforming NumPy datetime64 information 
into Python datetime information generally does the trick, as the following example, 
whose result is shown in Figure A-1, illustrates: 


In [70]: import matplotlib.pyplot as plt 
%matplotlib inline 


In [71]: np.random.seed(3000) 
rnd = np.random.standard_normal(len(dtl)).cumsum() ** 2 


In [72]: fig = plt.figure(figsize=(10, 6)) 
plt.plot(dtl.astype(dt.datetime), rnd) (13 
fig.autofmt_xdate(); 


Uses the datetime information as x values. 


Autoformats the datetime ticks on the x-axis. 


100 


oY oy oy oY oy oy oY oy oY 


Figure A-1. Plot with datetime x-ticks autoformatted 
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pandas 


The pandas package was designed, at least to some extent, with time series data in 
mind. Therefore, the package provides classes that are able to efficiently handle date 
and time information, like the DatetimeIndex class for time indices (see the docu- 
mentation at http://bit.ly/timeseries_doc). 


pandas introduces the Timestamp object as a further alternative to datetime and 
datetimeé4 objects: 


In [73]: import pandas as pd 

In [74]: ts = pd.Timestamp('2020-06-30') @ 
Out[74]: resem E 00:00:00') 

In [75]: d = ts.to_pydatetime() (2) 

Out[75]: Doane, 6, 30, 0, 0) 


In [76]: pd.Timestamp(d) © 
Out[76]: Timestamp('2020-06-30 00:00:00') 


In [77]: pd.Timestamp(nd) (4) 
Out[77]: Timestamp('2020-10-01 00:00:00') 


Timestamp object from str object. 
datetime object from Timestamp object. 


Timestamp from datetime object. 


o © 8 Ọ 


Timestamp from datetime64 object. 


Another important class is the aforementioned DatetimeIndex class, which is a col- 
lection of Timestamp objects with a number of helpful methods attached. A Dateti 
meIndex object can be created with the pd.date_range() function, which is rather 
flexible and powerful for constructing time indices (see Chapter 8 for more details on 
this function). Typical conversions are possible: 


In [78]: dti = pd.date_range('2020/01/01', freq='M', periods=12) (1) 
dti 
Out[78]: DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', 
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', 
'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'], 
dtype='datetime64[ns]', freq='M') 


In [79]: dti[6] 
Out[79]: Timestamp('2020-07-31 00:00:00', freq='M') 
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o 
(2) 
© 
(4) 


In [80]: pdi = dti.to_pydatetime() (2) 


Out[80]: 


In [81]: 
Out[81]: 


In [82]: 
Out[82]: 


DatetimeIndex object with monthly frequency for 12 periods. 


DatetimeIndex object converted to ndarray objects with datetime objects. 


pdi 


array([datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime. datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 
datetime.datetime(2020, 


pd.DatetimeIndex(pdi) © 


1, 
2, 
3, 
4, 
5, 
6, 
Ta 
8, 
9, 


31, 
29, 
31, 
30, 
31, 
30, 
31, 
31, 
30, 


0, 0), 
0, 0), 
0, 0), 
0, 0), 
0, 0), 
0, 0); 
0, 0), 
0, 0), 
0; 0), 


10, 31, 0, 0J; 
1i, 30, 0,9). 
12, 31, 0, 0)], dtype=object) 


DatetimeIndex(['2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', 
'2020-05-31', 

'2020-09-30', '2020-10-31', '2020-11-30', '2020-12-31'], 
dtype='datetime64[ns]', freq=None) 


pd.DatetimeIndex(dtl) (4) 
DatetimeIndex(['2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 
'2020-01-01 


00: 
02: 
04: 
06: 
08: 
10: 
12: 
14: 
16: 
18: 
:00: 
22: 


20 


00: 
00: 
00: 
00: 
00: 
00: 
00: 
00 
00: 
00: 


00: 


'2020-06-30', 


00',", 
00', 
00',", 
00"; 
00', 
00', 
00', 


200" ; 


00',", 
00', 
00'," 
00"; 


dtype='datetime64[ns]', 


"2020-01-01 
"2020-01-01 
"2020-01-01 
*2020-01-01 
*2020-01-01 
"2020-01-01 
"2020-01-01 
"2020-01-01 
"2020-01-01 
"2020-01-01 
"2020-01-01 
*2020-01-01 
freq=None) 


gis 
03: 
05: 
OTs 
09: 
It: 
138 
13; 
if; 
19: 
21: 
235 


'2020-07-31', 


00: 
00: 
00: 
00: 
00: 
00: 
00: 
300", 
00: 
00: 
00: 
00: 


00 


'2020-08-31', 


00"; 
00', 
00"; 
00"; 
00', 
og"; 
00', 


00', 
00%; 
00"; 
00'], 


DatetimeIndex object from ndarray object with datetime objects. 


DatetimeIndex object from ndarray object with datetime64 objects. 


pandas takes care of proper plotting of date-time information (see Figure A-2 and 
also Chapter 8): 


In [83]: rnd = np.random.standard_normal(len(dti)).cumsum() ** 2 


In [84]: df = pd.DataFrame(rnd, columns=['data'], index=dti) 
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In [85]: df.plot(figsize=(10, 6)); 


= data 
30 


25 


20 


15 


10 


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
2020 


Figure A-2. pandas plot with Timestamp x-ticks autoformatted 


pandas also integrates well with the pytz module to manage time zones: 


In [86]: pd.date_range('2020/01/01', freq='M', periods=12, 
tz=pytz.timezone('CET')) 
Out[86]: DatetimeIndex(['2020-01-31 00:00:00+01:00', '2020-02-29 
00:00:00+01:00', 
'2020-03-31 00:00:00+02:00', '2020-04-30 00:00:00+02:00', 
'2020-05-31 00:00:00+02:00', '2020-06-30 00:00:00+02:00', 
'2020-07-31 00:00:00+02:00', '2020-08-31 00:00:00+02:00', 
'2020-09-30 00:00:00+02:00', '2020-10-31 00:00:00+01:00', 
'2020-11-30 00:00:00+01:00', '2020-12-31 00:00:00+01:00'], 
dtype='datetime64[ns, CET]', freq='M') 


In [87]: dti = pd.date_range('2020/01/01', freq='M', periods=12, tz='US/Eastern' ) 
dti 
Out[87]: DatetimeIndex(['2020-01-31 00:00:00-05:00', '2020-02-29 
00:00:00-05:00', 
"2020-03-31 00:00:00-04:00', '2020-04-30 00:00:00-04:00', 
"2020-05-31 00:00:00-04:00', '2020-06-30 00:00:00-04:00', 
"2020-07-31 00:00:00-04:00', '2020-08-31 00:00:00-04:00', 
"2020-09-30 00:00:00-04:00', '2020-10-31 00:00:00-04:00', 
"2020-11-30 00:00:00-05:00', '2020-12-31 00:00:00-05:00'], 
dtype='datetime64[ns, US/Eastern]', freq='M') 


In [88]: dti.tz_convert('GMT') 
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Out[88]: DatetimeIndex(['2020-01-31 05:00:00+00:00', '2020-02-29 

05:00:00+00:00', 

"2020-03-31 04:00:00+00:00', '2020-04-30 04:00:00+00:00', 
"2020-05-31 04:00:00+00:00', '2020-06-30 04:00:00+00:00', 
"2020-07-31 04:00:00+00:00', '2020-08-31 04:00:00+00:00', 
"2020-09-30 04:00:00+00:00', '2020-10-31 04:00:00+00:00', 
"2020-11-30 05:00:00+00:00', '2020-12-31 05:00:00+00:00'], 

dtype='datetime64[ns, GMT]', freq='M') 
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APPENDIX B 
BSM Option Class 


Class Definition 


The following presents a class definition for a European call option in the Black- 
Scholes-Merton (1973) model. The class-based implementation is an alternative to 
the one based on functions as presented in “Python Script” on page 392: 


# 

# Valuation of European call options in Black-Scholes-Merton model 
# incl. vega function and implied volatility estimation 

# -- class-based implementation 

# 

# Python for Finance, 2nd ed. 

# (c) Dr. Yves J. Hilpisch 

# 

from math import log, sqrt, exp 

from scipy import stats 


class bsm_call_option(object): 
''' Class for European call options in BSM model. 


Attributes 


SO: float 
initial stock/index level 
K: float 
strike price 
T: float 
maturity (in year fractions) 
Fe float 
constant risk-free short rate 
sigma: float 
volatility factor in diffusion term 
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Methods 


value: float 
returns the present value of call option 
vega: float 
returns the vega of call option 
imp_vol: float 
returns the implied volatility given option quote 


FPR 


def _ init__(self, SO, K, T, r, sigma): 
self.SO = float(S0) 
self.K = 
self .T 
self.r = 
self.sigma = sigma 


K 
T 
i 


def value(self): 
''' Returns option value. 
d1 = ((log(self.SO / self.K) + 
(self.r + 0.5 * self.sigma ** 2) * self.T) / 
(self.sigma * sqrt(self.T))) 
((log(self.SO / self.K) + 
(self.r - 0.5 * self.sigma ** 2) * self.T) / 
(self.sigma * sqrt(self.T))) 
value = (self.SO * stats.norm.cdf(d1i, 0.0, 1.0) - 
self.K * exp(-self.r * self.T) * stats.norm.cdf(d2, 0.0, 1.0)) 
return value 


d2 


def vega(self): 
''' Returns vega of option. 
d1 = ((log(self.SO / self.K) + 
(self.r + 0.5 * self.sigma ** 2) * self.T) / 
(self.sigma * sqrt(self.T))) 
vega = self.SO * stats.norm.pdf(d1, 0.0, 1.0) * sqrt(self.T) 
return vega 


def imp_vol(self, CO, sigma_est=0.2, it=100): 
''' Returns implied volatility given option price. 
option = bsm_call_option(self.S0, self.K, self.T, self.r, sigma_est) 
for i in range(it): 
option.sigma -= (option.value() - C0) / option.vega() 
return option.sigma 
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Class Usage 


This class can be used in an interactive Jupyter Notebook session as follows: 


In [1]: 
In [2]: 


Out[2]: 


In [3]: 
Out[3]: 


In [4]: 
Out[4]: 


In [5]: 
Out[5]: 


from bsm_option_class import * 


o = bsm_call_option(100., 105., 1.0, 0.05, 0.2) 
type(o) 
bsm_option_class.bsm_call_option 


value = o.value() 
value 
8.021352235143176 


o.vega() 
39 .67052380842653 


o.imp_vol(C0=value) 
0.2 


The option class can also be used to visualize, for example, the value and vega of the 
option for different strikes and maturities. It is, in the end, one of the major advan- 
tages of having an analytical option pricing formula available. The following Python 
code generates the option statistics for different maturity-strike combinations: 


In [6]: 


import numpy as np 
maturities = np.linspace(0.05, 2.0, 20) 
strikes = np.linspace(80, 120, 20) 
K, T = np.meshgrid(strikes, maturities) 
C = np.zeros_like(K) 
V = np.zeros_like(C) 
for t in enumerate(maturities): 
for k in enumerate(strikes): 


o.T = t[1] 
o.K = k[1] 
c[t[0], k[0]] = o.value() 
VEt[O], k[0]] = 0.vega() 


First, a look at the option values. Figure B-1 presents the value surface for the Euro- 
pean call option: 


In [7]: 


In [8]: 


from pylab import cm, mpl, plt 

from mpl_toolkits.mplot3d import Axes3D 
mpL.rcParams['font.family'] = 'serif' 
%matplotlib inline 


fig = plt.figure(figsize=(12, 7)) 

ax = fig.gca(projection='3d') 

surf = ax.plot_surface(K, T, C, rstride=1, cstride=1, 
cmap=cm.coolwarm, Linewidth=0.5, antialiased=True) 

ax.set_xLabel('strike') 
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ax.set_ylabel('maturity') 
ax.set_zlabel('European call option value') 


fig.colorbar(surf, shrink=0.5, aspect=5); 
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Figure B-1. Value surface for European call option 
Second, a look at the vega values. Figure B-2 presents the vega surface for the Euro- 


pean call option: 


fig = plt.figure(figsize=(12, 7)) 


ax = fig.gca(projection='3d') 
surf = ax.plot_surface(K, T, V, rstride=1, cstride=1, 
cmap=cm.coolwarm, Linewidth=0.5, antialiased=True) 


ax.set_xLabel('strike') 


ax.set_ylabel('maturity') 
ax.set_zlabel('Vega of European call option') 


fig.colorbar(surf, shrink=0.5, aspect=5); 


In [9]: 
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Figure B-2. Vega surface for European call option 
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Symbols 

% character, 71 

%time function, 276 

%timeit function, 276 

* (multiplication) operator, 150, 161 

+ (addition) operator, 150, 161 

2D plotting 
interactive, 195-203 
matplotlib import and customization, 168 
one-dimensional data sets, 169-176 
other plot styles, 183-191 
two-dimensional data sets, 176-183 

3D plotting, 191-194 

__abs__ method, 160 

__add__ method, 161 

__bool__ method, 160 

__getitem__ method, 161 

__init__ method, 155, 159 

__iter__ method, 162 

__len__ method, 161 

__mul_ method, 161 

__repr__ method, 160 

__sizeof__ method, 150 

{} (curly braces), 71 


A 


absolute differences, calculating, 212 
absolute price data, 442 

abstraction, 147 

acknowledgments, xviii 

adaptive quadrature, 336 

addition (+) operator, 150, 161 
aggregation, 148, 158 

Al-first finance, 28 


Index 


algorithmic trading 
automated trading, 521-554 
FXCM trading platform, 467-481 
trading strategies, 483-520 
algorithms (see also financial algorithms) 
Fibonacci numbers, 286-289 
for supervised learning, 448 
for unsupervised learning, 444 
prime numbers, 282-285 
the number pi, 290-293 
Amazon Web Services (AWS), 50 
American options, 376, 380, 607-614 
anonymous functions, 80 
antithetic paths, 573 
antithetic variates, 373 
append() method, 136 
appending, using pandas, 136 
apply() method, 142, 218 
approximation 
interpolation technique, 324-328 
main focus of, 312 
package imports and customizations, 312 
regression technique, 313-324 
arbitrary-precision floats, 65 
array module, 88 
arrays (see also NumPy) 
handling with pure Python code, 86-90 
T/O with PyTables, 262 
Python array class, 88-90 
writing and reading NumPy arrays, 242 
artificial intelligence (AI), 28 
Asian payoff, 606 
attributes, in object-oriented programming, 
145 
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attributions, xvi 
automated trading 
capital management, 522-532 
infrastructure and deployment, 546 
logging and monitoring, 547-549 
ML-based trading strategy, 532-543 
online algorithm, 544 
Python scripts, 550-554 
risk management, 547 
average_cy1() function, 280 
average_nb() function, 279 
average_np() function, 278 
average_py() function, 277 


B 


Bayesian statistics 
Bayesian regression, 430 
Bayes’ formula, 429 
concept of, 398 
real-world data application, 435 
updating estimates over time, 439 
Benevolent Dictator for Life, 5 
Bermudan exercise, 380, 607 
big data, 13, 231 
binomial trees 
Cox, Ross, and Rubinstein pricing model, 
294 
Cython implementation, 297 
Numba implementation, 297 
NumPy implementation, 295 
Python implementation, 294 
bit_length() method, 62 
Black-Scholes-Merton (BSM), 14, 299, 353, 356, 
369, 673-676 
Booleans, 66 
boxplots, 188 
Brownian motion, 299, 354, 356, 399, 491 
bsm_functions.py module, 378 


C 


call options, 375 
callback functions, 477 
candles data, 472 
capital asset pricing model, 398 
capital management 
Kelly criterion for stocks and indices, 
527-532 
Kelly criterion in binomial settings, 522-526 
capital market line, 425 


changes over time, calculating, 212-215 
charts and graphs (see data visualization) 
Chi square distribution, 351 
Cholesky decomposition, 365 
class attributes, 145 
classes 
building custom, 154-159 
in object-oriented programming, 145 
classification problems, 448, 504-511 
cloud instances 
basics of, 34 
benefits of, 56 
files required, 51 
installation script for Python and Jupyter 
Notebook, 53 
Jupyter Notebook configuration file, 52 
major tools used, 50 
RSA public and private keys, 51 
script to orchestrate Droplet setup, 55 
selecting appropriate hardware architecture, 
273 
service providers, 50 
code examples, obtaining and using, xvi 
coin tossing game, 522 
comparison operators, 66 
compilation 
dynamic compiling, 276, 279 
packages to speed up algorithms, 308 
static, 280 
complex selection, using pandas, 132-135 
composition, 148 
compressed tables, 260 
concatenation, using pandas, 135 
conda 
basic package management with, 37-41 
Miniconda installation, 35 
virtual environment management with, 
41-44 
constant short rate, 563 
constant volatility, 365 
constants, 565 
containers, 34 (see also Docker containers) 
contingent claims, valuation of, 375 
control structures, 78 
convex optimization 
constrained optimization, 332 
global minimum representation, 328 
global optimization, 329 
local optimization, 331 
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use cases for, 328 
correlation analysis 
data for, 222 
direct correlation measures, 227 
logarithmic returns, 224 
OLS regression, 226 
count() method, 76 
counter-based looping, 78 
covariance matrix, 416 
covariances, 398 
Cox, Ross, and Rubinstein pricing model, 294, 
359 
create_plot() function, 312 
create_ts() function, 269 
credit valuation adjustments (CVA), 388 
credit value-at-risk (CVaR), 388 
CSV files 
T/O with pandas, 250 
reading and writing with Python, 236 
cubic splines interpolation, 426 
Cufflinks library, 167, 195, 199 
cumsum() method, 171, 177, 215 
curly braces ({}), 71 
curves, 565 
Cython 
benefits of, 62, 281 
binomial trees using, 297 
exponentially weighted moving average 
(EWMA), 307 
looping in, 280 
Monte Carlo simulation using, 302 
prime number algorithm, 284 
recursive function implementations, 286 
special data type for larger numbers, 288 


D 


data visualization 
interactive 2D plotting, 195-203 
packages for, 167 
static 2D plotting, 168-191 
static 3D plotting, 191-194 
using pandas, 126 

Data-Driven Documents (D3.js) standard, 167, 
195 

data-driven finance, 24 

DataFrame class 
benefits of, 114 
major features of, 115 


working with DataFrame objects, 115-118, 
152 
working with ndarray objects, 119-123, 151, 
170 
DataFrame() function, 119 
date-time information (see also financial time 
series data) 
financial plots, 199-203 
managing with pandas, 119-123 
modeling and handling dates, 561 
NumPy functionality for handling, 665-667 
pandas functionality for handling, 668-670 
parsing with regular expressions, 74 
plotting, 667 
Python datetime module, 659-665 
datetime module, 659-665 
datetime64 information, 667 
DatetimeIndex objects, 120, 668 
date_range() function, 121 
DAX 30 stock index, 637 
decision trees (DTs), 452 
deep learning (DL), 28, 454 
deep neural networks (DNNs) 
benefits and drawbacks of, 454 
feature transforms, 457 
trading strategies and, 512-519 
train-test splits and, 459 
with scikit-learn, 454 
with TensorFlow, 455 
delta, 599 
derivatives analytics 
derivatives valuation, 595-616 
DX analytics package, 556, 617 
DX pricing library, 555 
market-based valuation, 637-657 
portfolio valuation, 617-636 
simulation of financial models, 571-592 
valuation framework, 557-569 
derivatives portfolios 
class to model, 622-626 
use cases for, 626-633 
derivatives positions 
class to model, 618 
use cases for, 620 
derivatives valuation 
American exercise, 607-614 
European exercise, 600-607 
generic valuation class, 596-599 
derivatives_portfolio class, 627, 634 
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derivatives_position class, 634 
describe() function, 123, 211 
deserialization, 233 
df.iplot() method, 196 
diachronic interpretation, 429 
dict objects, 81, 235 
diff() function, 213 
digitalization, 10 
DigitalOcean, 50 
dir function, 63 
discretization error, 356 
diversification, 416 
Docker containers 
basics of, 45 
benefits of, 50 
building an Ubuntu and Python Docker 
image, 46-50 
Docker images versus Docker containers, 45 
double-precision standard, 64 
downsampling, 215 
Droplets, 50, 55 
DST (Daylight Saving Time), 663 
dst() method, 663 
DX (Derivatives analytiX) pricing library, 555 
DX analytics package, 556, 617 
dx.constant_short_rate class, 564, 617 
dx.derivatives_portfolio, 626 
dx.geometric_brownian_motion class, 582, 
602, 617 
dx.jump_diffusion class, 583, 617 
dx.market_environment class, 565, 577, 617, 
621 
dx.square_root_diffusion class, 588, 617 
dx.valuation_class class, 599 
dx.valuation_mcs_american class, 611, 618 
dx.valuation_mcs_european class, 602, 618 
dx_frame.py module, 568 
dx_simulation.py, 591 
dynamic compiling, 276, 279 
dynamic simulation, 356 
dynamically typed languages, 62 


E 

early exercise premium, 382 

Editor, 50 

efficient frontier, 421, 424 

efficient markets hypothesis (EMH), 399, 492 
Eikon Data API, 25 

elif control element, 79 


else control element, 79 
encapsulation, 148, 156 
estimation of Greeks, 599 
estimation problems, 448 
Euler scheme, 357, 360, 583 
European options, 375, 600-607, 673-676 
eval() method, 142 
event-based backtesting, 537 
ewma_cy() function, 307 
ewma_nb() function, 307 
ewma_py() function, 306 
Excel files, I/O with pandas, 251 
.executemany() method, 246 
execution time, estimating for loops, 276 
expected portfolio return, 418 
expected portfolio variance, 418 
exponentially weighted moving average 
(EWMA) 
Cython implementation, 307 
equation for, 304 
Numba implementation, 307 
Python implementation, 305 


F 
fat tails, 385, 413 
feature transforms, 457 
Fibonacci numbers, 286-289 
fib_rec_pyl() function, 286 
filter() function, 80 
finance 
AI-first finance, 28 
data-driven, 24 
role of Python in, 14-24 
role of technology in, 9-14 
financial algorithms (see also algorithms; auto- 
mated trading; trading strategies) 
Black-Scholes-Merton (BSM), 14, 299, 353, 
356, 369, 673-676 
Cox, Ross, and Rubinstein pricing model, 
294, 359 
first-best versus best solutions, 308 
Least-Squares Monte Carlo (LSM), 381, 608 
online algorithm, 544 
simulation of financial models, 571-592 
support vector machine (SVM), 29, 460 
financial and data analytics 
challenges of, 13 
definition of, 13 
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selecting appropriate hardware architecture, 
273 
write once, retrieve multiple times, 267 
financial indicators, 217 
financial instruments 
custom modeling using Python classes, 
154-159 
symbols for (RICs), 209 
financial studies, 217 
financial theory, 398 
financial time series data 
changes over time, 212-215 
correlation analysis using pandas, 222-227 
data import using pandas, 206-209 
definition and examples of, 205 
high frequency data using pandas, 228 
package imports and customizations, 206 
recursive pandas algorithms for, 304-308 
resampling, 215 
rolling statistics using pandas, 217-222 
statistical analysis of real-world data, 
409-415 
summary statistics using pandas, 210-212 
tools for, 205 
find_MAP() function, 432 
first in, first out (FIFO) principle, 235 
first-best solution, 308 
fixed Gaussian quadrature, 336 
flash trading, 12 
floats, 63 
flow control, 68 
for loops, 78 
foresight bias, avoiding, 217 
format() function, 71 
frequency approach, 501-503 
frequency distribution, 631 
full truncation, 360 
functional programming, 80 
Fundamental Theorem of Asset Pricing, 
558-560 
FXCM trading platform 
getting started, 469 
retrieving prepackaged historical data 
candles data, 472 
historical market price data sets, 469 
tick data, 470 
risk disclaimer, 468 
working with the API 
account information, 480 


candles data, 475 
initial steps, 474 
placing orders, 478 
streaming data, 477 
fxcmpy package, 469 


G 

Gaussian mixture, 444, 447 

Gaussian Naive Bayes (GNB), 449, 504 
gbm_mcs_dyna() function, 377 
gbm_mcs_stat() function, 376 
generate_paths() method, 580 
generate_payoff() method, 600 
generate_time_grid() method, 574 
generic simulation class, 574-577 

generic valuation class, 596-599 
gen_paths() function, 399 

geometric Brownian motion, 356, 399, 577-582 
get_info() method, 619 
get_instrument_values() method, 575 
get_price() method, 156 
get_year_deltas() function, 562 

graphs and charts (see data visualization) 
Greeks, estimation of, 599 

Greenwich Mean Time (GMT), 662 
GroupBy operations, 130 


H 

hard disk drives (HDDs), 231 

HDF5 database standard, 252, 264 
Heston stochastic volatility model, 365 
hidden layers, 454 

high frequency data, 228 

histograms, 186, 225 

hit ratio, 500 

hybrid disk drives, 231 


| 

idioms and paradigms, 308 
IEEE 754, 64 

if control element, 79 
immutable objects, 76 
import this command, 4 
importing, definition of, 6 
index() method, 76 

info() function, 123, 211 
inheritance, 147 
input/output (I/O) operations 
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compatibility issues, 236 
role in financial analyses, 231 
with pandas 
from SQL to pandas, 247 
working with CSV files, 250 
working with Excel files, 251 
working with SQL databases, 245 
with PyTables 
out-of-memory computations, 264 
working with arrays, 262 
working with compressed tables, 260 
working with tables, 253 
with Python 
reading and writing text files, 236 
working with SQL databases, 239 
writing and reading NumPy arrays, 242 
writing objects to disk, 232 
with TsTables 
data retrieval, 270 
data storage, 269 
sample data, 267 
instance attributes, 145 
instantiation, in object-oriented programming, 
146 
integers, 62, 149 
integrated development environments (IDEs), 
6 
integration 
integration by simulation, 337 
integration interval, 335 
numerical integration, 336 
package imports and customizations, 334 
use cases for, 334 
interactive 2D plotting 
basic plots, 195-199 
financial plots, 199-203 
packages for, 195 
interpolation technique 
basic idea of, 324 
linear splines interpolation, 324 
potential drawbacks of, 328 
sci.splrep() and sci.splev() functions, 325 
[Python 
benefits and history of, 6 
exiting, 48 
GBM simulation class, 580 
installing, 39 
interactive data analytics and, 19 
tab completion capabilities, 62 


with Python 2.7 syntax, 42 
is_prime() function, 283, 285 
is_prime_cy2() function, 285 
is_prime_nb() function, 285 
iterative algorithms, 287 


J 
joining, using pandas, 137 
jump diffusion, 369, 582-586 
Jupyter 
downloading, xvi 
Jupyter Notebook 
basics of, 50 
configuration file, 52 
history of, 6 
installation script, 53 
security measures, 53 


K 
k-means clustering algorithm, 444, 446, 
499-501 
Kelly criterion 
for stocks and indices, 527-532 
in binomial settings, 522-526 
kernel density estimator (KDE), 225 
key-value stores, 81 
keyword module, 66 
kurtosis test, 405 


L 


lambda functions, 80 
LaTeX typesetting, 189, 339 
Least-Squares Monte Carlo (LSM), 381, 608 
least-squares regression, 321 
left join, 137 
leverage effect, 365 
linear regression, 314 
linear splines interpolation, 324 
list comprehensions, 79 
lists 
constructing arrays with, 86 
defining, 76 
expanding and reducing, 77 
looping over, 79 
in market environment, 565 
in object-oriented programming, 150 
operations and methods, 78 
LLVM (low level virtual machine), 279 
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log returns, calculating, 214, 224 
log-normal distribution, 354, 399 
logical operators, 67 
logistic regression (LR), 451, 504 
longest drawdown period, 540 
Longstaff-Schwartz model, 608 
loops 

Cython, 280 

estimating execution time, 276 

Numba, 279 

NumPy, 278 

Python, 277 
loss level, 388 


M 


machine learning (ML) 
adoption of in financial industry, 28 
basics of, 398 
packages for, 444 
supervised learning, 448-461 
types covered, 444 
unsupervised learning, 444-447 
map() function, 80 
market environments, 565, 574 
market-based valuation 
model calibration, 641-650 
options data, 638-640 
Python code for, 654 
Markov chain Monte Carlo (MCMC) sampling, 
432, 437 
Markov property, 356 
Markowitz, Harry, 397, 415 
martingale approach, 560 
martingale measure, 375, 558, 578 
mathematical tools 
adoption of applied mathematics in finan- 
cial industry, 311 
approximation, 312-328 
convex optimization, 328-334 
integration, 334-337 
mathematics and Python syntax, 18 
symbolic computation, 337-343 
matplotlib 
basics of, 8 
benefits of, 167 
boxplot generation using, 188 
date-time information, 667 
histogram generation using, 186, 225 
matplotlib gallery, 189 


NumPy data structures and, 171 
pandas wrapper around, 126 
scatter plot generation using, 184, 246 
static 2D plotting using, 168-191 
maximization of long-term wealth, 522 
maximization of the Sharpe ratio, 421 
maximum drawdown, 540 
McKinney, Wes, 205 
mcs_pi_py() function, 292 
mcs_simulation_cy() function, 302 
mcs_simulation_nb() function, 302 
mcs_simulation_np() function, 301 
mcs_simulation_py() function, 300 
mean return, 398 
mean() method, 129 
mean-reverting processes, 359 
mean-squared error (MSE), 646 
mean-variance portfolio selection, 420 
memory layout, 110 
memoryless process, 356 
merging, using pandas, 139 
methods, in object-oriented programming, 145 
Miniconda, 35 
minimization function, 421 
minimization of portfolio variance, 423 
minimize() function, 421 
min_func_sharpe() function, 423 
ML-based trading strategy 
optimal leverage, 538 
overview of, 532 
persisting model object, 543 
risk analysis, 539-543 
vectorized backtesting, 533-537 
MLPClassifier algorithm class, 454 
Modern Portfolio Theory (MPT), 415 (see also 
portfolio optimization) 
modularization, 147, 617 
moment matching, 374, 573 
Monte Carlo simulation, 14, 290, 299-304, 337, 
352; 375 
multiplication (*) operator, 150, 161 
multiprocessing module, 276, 285, 303 
mutable objects, 77 


N 


noisy data, 319 
nonredundancy, 148 
norm.pdf() function, 403 
normal distribution, 398 
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normal log returns, 399 
normality tests 
benchmark case, 399-409 
real-world data, 409-415 
role of in finance, 397, 398 
skewness, kurtosis, and normality, 405 
normality_tests() function, 405 
normalization, 214 
normalized price data, 442 
normaltest(), 405 
now() function, 662 
np.allclose() function, 234 
np.arange() function, 242, 666 
np.concatenate() function, 373 
np.dot() function, 419 
np.exp() function, 215 
np.lin space() function, 312 
np.meshgrid() function, 192 
np.polyfit(), 313, 325 
np.polyval(), 313, 325 
np.sum() function, 142 
npr.lognormal() function, 354 
npr.standard_normal() function, 354 
Numba 
binomial trees using, 297 
exponentially weighted moving average 
(EWMA), 307 
looping in, 279 
Monte Carlo simulation using, 302 
potential drawbacks of, 279 
prime number algorithm, 283 
numerical integration, 336 
NumPy 
basics of, 8, 85 
binomial trees using, 295 
data structures covered, 85 
date-time information, 665-667 
datetime64 information, 667 
handling arrays of data with Python, 86-90 
looping in, 278 
Monte Carlo simulation using, 301 
regular NumPy arrays 
Boolean arrays, 101 
built-in methods, 91 
mathematical operations, 92 
metainformation, 97 
multiple dimensions, 94 
NumPy dtype objects, 97 
numpy.ndarray class, 90, 151, 170 


reshaping and resizing, 98 
speed comparison, 103 
universal functions, 92 
structured NumPy arrays, 105 
universal functions applied to pandas, 126 
vectorization of code, 106-112 
writing and reading NumPy arrays, 242 
numpy.random subpackage, 346, 572 
NUTS() function, 432 


0 


object relational mappers, 239 

object-oriented programming (OOP) 
benefits and drawbacks of, 145 
dx.derivatives_portfolio class, 626 
example class implementation, 146 
features of, 147 
Python classes, 154-159 
Python data model, 159-163 
Python objects, 149-154 
terminology used in, 145 
Vector class, 163 

objects, in object-oriented programming, 145 

online algorithm, 544 

OpenSSL, 51 

optimal decision step, 609 

optimal fraction f *, 523 

optimal stopping problem, 380, 608 

option pricing theory, 399 

opts object, 422 

ordinary least-squares (OLS) regression, 226, 
494-498 

out-of-memory computations, 264 

overfitting, 491 


P 
package managers 
basics of, 34 
conda basic operations, 37-41 
Miniconda installation, 35 
pandas 
basic analytics, 123-126 
basic visualization, 126 
basics of, 8 
benefits of, 113 
calculating changes over time using, 
212-215 
complex selection, 132-135 
concatenation, 135 
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correlation analysis using, 222-227 
data formats supported, 244 
data structures covered, 113 
DataFrame class, 114-123, 152 
date-time information, 668-670 
development of, 205 
error tolerance of, 126 
GroupBy operations, 130 
handling high frequency data using, 228 
import-export functions and methods, 245 
importing financial data using, 206-209 
joining, 137 
merging, 139 
multiple options provided by, 143 
NumPy universal functions and, 126 
performance aspects, 141 
recursive function implementations, 
304-308 
rolling statistics using, 218 
Series class, 128 
summary statistics using, 210-212 
working with CSV files in, 250 
working with Excel files in, 251 
working with SQL databases in, 245 
paradigms and idioms, 308 
parallel processing, 285 
parallelization, 303, 308 
parameters, in object-oriented programming, 
146 
pcet_change() function, 213 
pd.concat() function, 136 
pd.date_range() function, 668 
pd.read_csv() function, 206, 245, 251 
percentage change, calculating, 213 
perfect foresight, 217 
performance Python 
algorithms, 281-293 
approaches to speed up tasks, 275, 308 
binomial trees, 294-298 
ensuring high performance, 21 
loops, 276-281 
Monte Carlo simulation, 299-304 
recursive pandas algorithms, 304-308 
supposed Python shortcomings, 275 
pi (1), 290 
pickle.dump() function, 233 
pickle.load() function, 233 
plot() method, 126, 129 
plotly 


basic plots, 195 
benefits of, 167, 195 
Getting Started with Plotly for Python 
guide, 195 
local or remote rendering, 195 
plotting types available, 198 
plot_option_stats() function, 605 
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simulation 
dynamic simulation, 356 
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The animal on the cover of Python for Finance is a Hispaniolan solenodon. The His- 
paniolan solenodon (Solenodon paradoxus) is an endangered mammal that lives on 
the Caribbean island of Hispaniola, which comprises Haiti and the Dominican 
Republic. It’s particularly rare in Haiti and a bit more common in the Dominican 
Republic. 


Solenodons are known to eat arthropods, worms, snails, and reptiles. They also con- 
sume roots, fruit, and leaves on occasion. A solenodon weighs a pound or two and 
has a foot-long head and body plus a ten-inch tail, give or take. This ancient mammal 
looks somewhat like a big shrew. It’s quite furry, with reddish-brown coloring on top 
and lighter fur on its undersides, while its tail, legs, and prominent snout lack hair. 


It has a rather sedentary lifestyle and often stays out of sight. When it does come out, 
its movements tend to be awkward, and it sometimes trips when running. However, 
being a night creature, it has developed an acute sense of hearing, smell, and touch. 
Its own distinctive scent is said to be “goatlike.” 


It excretes toxic saliva from a groove in the second lower incisor and uses it to para- 
lyze and attack its invertebrate prey. As such, it is one of few venomous mammals. 
Sometimes the venom is released when fighting among each other, and can be fatal to 
the solenodon itself. Often, after initial conflict, they establish a dominance relation- 


ship and get along in the same living quarters. Families tend to live together for a 
long time. Apparently, it only drinks while bathing. 


Many of the animals on O’Reilly covers are endangered; all of them are important to 
the world. To learn more about how you can help, go to animals.oreilly.com. 
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